Algorithms Are Almost Fluent in Human Speech, so Why Are They So Biased?
Voice recognition software is everywhere. In 2020, almost two-thirds of Americans reported using some type of voice-operated automated assistant. It’s no surprise that these virtual helpers are run off of artificial intelligence — they’re “people” that are consciously responding to commands.
Voice recognition falls under the umbrella of natural language processing, a field of computer science that focuses on training AI and computers to identify and respond to the spoken and written word.
But natural language processing isn’t quite as artificial as the name may imply — it’s largely based on the human brain.
Millions of neurons run up and down the nervous system, through the flow of the spinal cord and nooks and crannies of the brain. These neurons move messages between locations, and they meet at synapses. Synapses transfer the messages between neurons by stimulating target neurons, the next step on a message’s journey.
NLP’s "nervous system" is remarkably similar. The "map" of an artificial neural network looks like a web, with thousands of circles connected by an array of lines, connected to circles, connected to lines, and so on and so forth. Here, a neuron receives a signal, called an input, does some mathematical transformations to the input, and spits out an output. The neurons meet at "synapses", which control the neuronal connection by using a weighted average function. The information travels through the path of neurons and synapses until it reaches the end, generating a final output.
It’s all remarkably human — too human even, because just like humans, NLP often falls victim to bias.
In humans, auditory bias can come in many forms. For example, confirmation bias occurs when we only hear what we want to hear, picking out details that validate what falls in line with our beliefs. Anchoring bias occurs when the first piece of information we hear changes how we perceive the rest of the information, like in bargaining when the starting price sets the stage for the rest of the deal.
Bias in how we hear and process sound goes far deeper, though, into territories involving racism, sexism, and xenophobia. A 2010 study on accents showed that we judge individuals more on how they speak, as opposed to how they look. This idea of accents creeping into our impressions of the individual has rather dramatic consequences in the real world. One study found that, when interviewing over the phone, people with Chinese, Mexican, and Indian accented English are actively discriminated against by managers, while individuals with British-accented English were treated the same, and at times better, than American-accented individuals.
NLPs, like humans, tend to have biases in favor of certain accents and against others. A study, “Gender and Dialect Bias in YouTube’s Automatic Captions” studied the accuracy of YouTube’s caption system, which runs on NLP, to assess the presence of bias in the captioning of English dialects. The study took advantage of a popular trend, known as the Accent Challenge, where individuals from different parts of the world read off a list of predetermined words— anything from “avocado” to “Halloween.” The results showed that individuals with Scottish and New Zealand dialects had statistically significant word error rates (WER), indicating that the captioning system has a degree of bias against these populations.
The study went a step further. It investigated the impact of gender on the word error rate. While the algorithm incorrectly identified the men’s speech roughly 40% of the time, it incorrectly identified more than 50% of the women’s speech. Depending on the accent, discrepancies between female and male speech could be as high as 30%.
Gender bias in NLP goes far beyond word misidentification. Word embedding is a branch of NLP that deals with representing words with similar meanings. It often involves creating a field filled with scattered points, with points representing certain words. For example, “dinner” and “lunch” may be located close by on a plane, while “shoe” would be farther away. A 2016 paper investigated common word associations with gender using a word embedding plane. For “he” (the identifier used by the group to designate males), the four jobs most strongly associated with men were maestro, skipper, protégé, and philosopher, respectively.
For women, the most common words were homemaker, nurse, receptionist, and librarian.
The team also used the word embeddings to generate analogies — the famous “x is to y as a is to b” questions from far too many SAT prep classes. Among the biased analogies, the set generated “father is to a doctor as a mother is to a nurse” and “man is to computer programmer as woman is to homemaker.” The data used to create the word embedding was derived from Google News articles, indicating that these articles perpetuate outdated gender stereotypes and roles. These patterns reflect a disappointing trend within NLP. Computers are learning archaic human biases: That women are the homemakers, and a submissive sex, while men are the innovative breadwinners.
Racism is another prevalent issue in the world of biased NLP. In “Racial disparities in automated speech recognition,” a research team investigated the performance of five, state-of-the-art automatic speech recognition (ASR) technologies between white and Black subjects. The study examined some of the most common ASR tech today— developed by Amazon, Apple, Google, IBM, and Microsoft.
Every one showed statistically significant racial disparity.
The average word error rate for white subjects was 0.19, while the word error rate among Black subjects was 0.35, almost twice as high. For Apple, the worst-performing ASR, the word error rate was 0.45 for Black individuals, but just 0.23 for white individuals.
The study credits African American Vernacular English (AAVE) as being part of the reason for the discrepancy. Many databases do not include adequate portions of AAVE sound samples, despite it being a recognized English dialect with millions of native speakers.
African American Vernacular English was born out of slavery. When people were kidnapped and sold into slavery, they were often separated from others who spoke similar languages and dialects, being forced to work on plantations with those whom they had difficulty communicating with. Two theories emerged to explain the formation of AAVE: the dialect hypothesis and the Creole hypothesis. The dialect hypothesis proposes that the dialect emerged because enslaved people came in contact with southern whites and learned English out of necessity, creating a branch that later became AAVE. The Creole hypothesis suggests that the dialect’s formation was more of a mixing pot; West African languages and English combined into a Creole language that converged with Standard English to form AAVE.
Today, AAVE remains highly scrutinized. Some people call it “broken,” “lazy,” and ungrammatical, closely associating it with poor education and lack of linguistic knowledge. AAVE’s negative connotations are rooted in racism. African American Vernacular English is, by definition, overwhelmingly spoken by African-Americans, a group who have historically been stereotyped and exploited. The discrepancies between NLP performance in White and Black individuals perpetuate these ideas of AAVE being a “lesser-than” dialect, or a sign of “lower education.” AAVE is recognized as an official dialect of English, and has developed over centuries to have distinct grammatical formats, slang, and syntax — the facets of any “valid” language.
Language is constantly evolving. The benefit of living languages is that they are regularly updating and adapting themselves to incorporate new ideas, technologies, and innovations, or to make sure we understand the latest slang from your favorite TikTok video. And our AI needs to adapt with it. It is humans who program the words and sentence structures into our datasets and add them to the speech samples. Unlike humans, our AI-based natural language processing systems don’t have hundreds or even thousands of years of socialized bias to overcome. They can be easily adjusted by improving and increasing datasets— which means we can program NLP to break language bias faster than we can organically for our almost 8 billion inhabitants.
So what will it take to incorporate more diverse datasets into our constantly evolving NLPs?
This article is part of a series on bias in artificial intelligence. See the next installment here.