Seminar in Computational Linguistics

  • Date: –15:00
  • Location: Engelska parken 9-3042
  • Lecturer: Dirk Hovy
  • Contact person: Miryam de Lhoneux
  • Seminarium

Hidden Biases. Ethical Issues in NLP, and What to Do about Them

Language is probably the most human endeavor: through language, we fundamentally express who we are. Precisely because we express ourselves through language, we can use language to infer information about the authors of texts. This property makes text a fantastic resource for research into the complexity of the human mind, from social sciences to humanities.

However, it is exactly that human property of text that also creates some ethical problems. While we can explore the property of text to reflect the authors' biases, they can also have unintended consequences for our analysis, which get magnified by statistical models. If our data is not reflective of the population we want to study, if we do not pay attention to biases enshrined in language, we can easily draw the wrong conclusions, and create disadvantages for our subjects. 

In this talk, I will talk about four types of biases that affect statistical analysis of text, their sources, and potential counter measures. First, I will cover bias stemming from data, i.e., selection bias (if our texts do not adequately reflect the population we want to study) and label bias (if the labels we use are skewed). 

We will then look at biases deriving from the models themselves, i.e., their tendency to amplify any imbalances that are present in the data.

Finally, we will look at design bias, i.e., the biases arising from our (the researchers) decisions which topics to analyze, which data sets to use, and what to do with them.

For each bias, I will provide examples and discuss the possible ramifications for a wide range of applications.

Over the last few years, though, there has been an increasing body of work that not only uncovered such biases, but that has also shown various ways to address and counteract these biases, ranging from simple labeling considerations to new types of models.

I hope to leave the audience with a better, more nuanced understanding of the possible pitfalls in working with text, but also with a sense of how effectively these biases can be addressed with a little bit of forethought.