Let's read the text
Required reading: 7 minutes
Algorithmic prejudices result from erroneous data and/or their processing. They can cause discrimination against certain groups of people or minorities through intelligent systems. One example is the discrimination of female applicants in the context of an automatic selection procedure. But how does erroneous data occur?
Today, machine learning and algorithms are the basis for decisions that influence individual fates or entire population groups. Intelligent assistants calculate the suitability of applicants, analyse the most efficient route or obstacles for self-propelled cars and identify cancer on X-ray images. Data is the blood in the veins of such machines: it is the basis for self-learning systems and the ultimate template for all later calculations and recommendations.
Modern learning algorithms use predefined information collections (e.g. texts or images) to recognize patterns or logical connections and to reveal laws on which later decisions can be based. You will learn through examples. An algorithm is therefore only as good as the information on which it is based. This fact becomes a challenge with the increasing spread of machine learning.
"However, an algorithm is only as good as the data it works with."(from English, Barocas/Selbst)
Discrimination through intelligent systems
Because data are generated and processed by humans and are not perfect, just like their creators. For example, they reflect widespread prejudices or only capture certain groups of people. If an intelligent system works on the basis of such a data set, the result is often discrimination.
"Algorithmic prejudice occurs when a computer system reflects the implicit values of the people involved in coding, collecting, selecting, or using data to train the algorithm."(from English, Wikipedia)
There are numerous examples that demonstrate this challenge. Almost all large tech companies working with AI have already encountered the problem. Thus, in 2015, an algorithm identified by Google, people with darker skin color than gorillas. In October of 2018. Amazon Headlinesbecause an intelligent system rejected applications that contained the words "women" or "women college.
What serious consequences algorithmic bias can have, e.g. in image recognition, is demonstrated by Informatikerin Joy Boulamwini.
It is left to one's imagination what would happen if a self-propelled car used similar software to identify obstacles.
How algorithmic prejudices arise
But how do these algorithmic prejudices arise? Solon Barocas of Cornell University and Andrew D. himself of Yale Law School define five technical mechanisms in the processing of data that can influence their meaningfulness.
For those who would like to deal with the technical mechanisms in detail, we recommend reading the 56-page Big Data's Disparate Impact -=PDF´s=- proudly presents by Solon Barocas and Andrew D. himself. For the sake of completeness, it should be mentioned that the two authors refer in their comments to data mining, a close relative of machine learning. The aim of both methods is to identify patterns in data. Data mining is about finding new patterns, machine learning about recognizing known patterns.
"By definition, data mining is always a form of statistical (and therefore seemingly reasonable) discrimination. The purpose of data mining is to create a rational basis on which to distinguish between individuals (...)".(from English, Barocas/Selbst)
A simplified version of the most important error sources is as follows:
1. the subjective definition of target variables
A target variable translates a problem into a question. It therefore defines what a data scientist wants to find out. Even the definition of the target variables by an expert is a challenging, subjective process that can (even unintentionally) lead to discrimination. It is not for nothing that it is referred to as the "art of data mining". Let us assume, for example, that the target variable is the best employee in the company. In order to identify this person, the word "best" must first be defined in measurable values. This classification can be influenced by the individual perspective of the data scientist and thus lead to discrimination.
2. wrong handling of training data
Modern algorithms based on machine learning require training data (to train the algorithms) and test data (to test functionality).
- Incorrect marking: In some cases, training data is marked by the human being (monitored learning). He decides in advance which picture shows a dog and which picture shows a cat. If this assignment is incorrect, it directly influences the learning outcome.
- Sample bias: The majority of the training data set comprises a part of the population (such as light-skinned people), while another part is under-represented (such as dark-skinned people). Light-skinned people then receive better ratings on average.
- A historical distortion occurs when an algorithm is trained on the basis of an old data set that takes up past values and moral concepts (such as the role of women).
3rd Inaccurate Feature Selection
The selection of features is a decision about which attributes to consider and then include in the analysis of data. It is considered impossible to capture all attributes of a subject or to consider all environmental factors in a model. Therefore, for example, details may not receive enough attention and resulting recommendations may be inaccurate. Suppose we want to find the most suitable candidate for an open position. The degree from an elite university is defined as a qualifying criterion. However, neither the final grade nor the duration of study are taken into account. Due to the ignorance of these features, it can happen that the best candidate is not identified. It is therefore crucial to include the context and find the right balance between features and the size of the dataset.
4. masking / hidden discrimination
Masking refers to the deliberate (hushed up) discrimination by decision-makers with prejudices, e.g. through the deliberate distortion of a data collection by a programmer.
"A biased programmer could deliberately implement discrimination, for example by inserting discriminatory features in the definition of target variables."(from English, Barocas/Selbst)
The fight against algorithmic bias
The collection and generation of large amounts of data takes a lot of time (and money). Many data scientists therefore fall back on existing information collections and download them from the Internet. Biased data sets spread so rapidly and influence many different systems worldwide. More than 15 million engineers imported a word library provided by Google called Word2Vec, which is known to contain all sorts of historical prejudices.
The high costs reduce the motivation of those responsible to set up incorrect data records anew. As the algorithms are often a well-kept secret of the companies, it is also difficult for victims of discrimination to create legally valid evidence or to gain access to the data or its processing processes.
This circumstance and the human factor in algorithmic bias are currently the subject of intense debate among scientists, experts, politicians and journalists. Organizations such as Algorithmic Justice League or AI Now are actively committed to fighting the Algorithmic bias. Initial proposals for solutions, for example, call for a diversification of the industry, which to this day employs predominantly white, male specialists. Other experts suggest comprehensive legal measures to force companies, for example, to make their algorithms transparent and publishable.
Conclusion: Artificial intelligence and machine learning are only as good as the person who designs them. Due to the growing popularity of new technologies, data scientists and programmers are in the critical light of the public more than ever. However, placing these professionals under general suspicion or bringing more diversity to the industry does not solve the problem of algorithmic bias. Diversity and empowerment are important, but every human being - regardless of origin or gender - can be influenced by conscious or unconscious prejudice. Therefore, the technical process of data processing must be questioned and - if possible - optimized. The legal constraint to transparency can also motivate companies to prioritise and improve the quality of data and its processing.
- Picture: Photo by Bekir Mülazımoğlu, EyeEm
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai, arxiv.org/pdf/1607.06520.pdf
- Teaching Fairness to Artificial Intelligence: Existing and Novel Strategies against Algorithmic Discrimination under EU Law, Dr. Philipp Hacker, LL.M. (Yale), http://bit.ly/2BYX3sp
- Big Data's Disparate Impact, Solon Barocas & Andrew D. Self, http://bit.ly/2SyG5YZ