The ability of a computer to recognize some signals hidden in a mass of noisy data but not others is a well-known and currently intractable problem for computer scientists working in the field of language and sound processing. Known as the cocktail party problem, algorithms that can identify a specific voice and amplify it while dampening the cacophony of other voices, noises, and distortion produced by the environment have remained elusive to date.
Fortunately, scientists have a system to model to help them solve this problem: the human brain. Human beings are social animals, and as such our brains have been highly evolved to isolate and focus on the voice of the person we are talking to, dampening and often even suppressing other voices and environmental noise entirely to hear what a person is saying. Now, researchers are beginning to make significant progress in understanding how out the brain isolates and processes a specific voice among many as well as developing new approaches to solving the problem.
The cocktail party effect
The cocktail party effect, as it is known, is the ability of the human brain to focus on a single voice in a crowd and isolate it from the surrounding environmental noise. While this might seem fairly straight forward to some, it's easy to take the cocktail party effect for granted and not appreciate just how extraordinary a neurological process it is.
In a crowd, voices are distruptions in the surrounding air that bash and scatter into each other, making it difficult to hear anyone voice unless it simply overpowers all the rest be yelling or something similar. Since that isn't an ideal solution to the cocktail party problem, our brains do something else instead that is rather extraordinary.
In fractions of a second, our brains identify and isolate the voice signal of the person we want to listen to and amplifies it. Then, it filters or masks all other voice signals or noise so that these sounds are suppressed, allowing us to hear what a person is saying in most social circumstances.
Every day, our brains process an infinity of sound that it prioritizes in fractions of a second. And just as they are continually removing the image of that bit of our nose that physically extends into our otherwise unobstructed field of vision, our brain amplifies the sounds that we are focusing on and suppresses the other lower-priority noise in the environment so that they functionally disappear.
But how exactly our brains achieve this incredible cocktail party effect was a mystery for decades after the 'cocktail party problem' was first discussed by researchers in the 1950s. Fortunately, research from the past few years has shed light on how our brains identify and isolate these all-important voice-signals in social settings, bring us much closer than ever to replicating the same process using a machine.
Segregation of different voice signals in the auditory cortex
The last decade has seen major improvements in our understanding of how humans identify and process speech and language. A pair of researchers supported by the US National Institute on Deafness and Other Communication Disorders publishing a remarkable paper in the journal Nature in 2012 that showed how we could not only see how the brain was filtering and distinguishing between competing voice signals, but the researchers were even able to predict which word the respondent was listening to.
Edward Chang, Ph.D., a neurosurgeon and an associate professor at the University of California at San Francisco (UCSF) initially wasn't looking to identify how human achieve the cocktail party effect; he was treating patients with epilepsy. He implanted a sheet of 256 electrodes just underneath the skull of his patients to monitor the electrical activity in the outer layer of neurons of their temporal lobes.
Chang and Nima Mesgarani, Ph.D., a postdoctoral fellow at UCSF, realized that these patients presented them with a rare opportunity. With their sophisticated equipment--which was sensitive enough to detect the firing of a single neuron--and the fact that the intracranial electrodes would also be able to monitor the auditory cortex--which is located in the temporal lobe--they could study how the brain processes sound in unprecedented detail.
Three volunteer subjects listened to simultaneous audio recordings, one read by a woman and the other by a man, with instructions to listen for one of the two specific target words that would begin the audio sample, then they would report what the voice on the audio sample said after those words. By analyzing the readings from the electrodes using a decoding algorithm that could identify patterns and reconstruct what the subject heard, the researchers found that the readings from the electrodes only picked up the pattern of the targeted speaker, meaning that the auditory cortex ignores the non-target speaker entirely.
"A lot of people thought that the auditory cortex was just passing this information up to the cognitive part of the brain, the frontal cortex and the executive control areas, where it would be really processed,” said Chang. “What we found was that the auditory cortex is in and of itself pretty sophisticated. It’s as if it knows which sounds should be grouped together and only extracts those that are relevant to the single speaker.”
Even more remarkable is the fact that the decoding algorithm was able to predict which speaker the subject was listening to based on the neural activity alone and that it was able to detect the moment that the subjects attention shifted or strayed to the other speaker. What this tells us is that the auditory cortex holds the key to understanding how the human brain can deal with the cocktail party problem in a way that computers currently cannot.
Differentiating the voice from the sound
While a computer can decode the brains neural activity and know exactly what the auditory cortex actually heard, that isn't enough to overcome the cocktail party problem on its own; we still need to know how it is that the brain actually makes these distinctions and differentiates voice signals and other environmental noise to focus on the targeted voice.
Researchers at the University of Geneva, Switzerland (UNIGE) and the University of Maastricht in the Netherlands published a paper this summer in the journal Nature Human Behavior that tried to get at the root mechanism of this process, namely how the brain processes the voices we hear and the words that are being spoken.
To do this, the researchers devised a collection of pseudowords--words that have no meaning--spoken by a trained phonetician at three different pitches. The subjects hearing the voice samples were then asked to perform the specific auditory tasks of differentiating between different pitches of the same voice or by listening to the speech sounds themselves, known as phonemes.
"We created 120 pseudowords that comply with the phonology of the French language but that make no sense, to make sure that semantic processing would not interfere with the pure perception of the phonemes," said Narly Golestani, a professor in the Psychology Section at UNIGE’s Faculty of Psychology and Educational Sciences (FPES) and a co-author of the paper.
Sanne Rutten, a researcher at UNIGE's FPES and a co-author of the paper, said that the task of differentiating the voices of the speaker needed to be as difficult as possible for the subject in order to accurately study the way the brain performs this auditory processing. "To make the differentiation of the voices as difficult as the differentiation of the speech sounds, we created the percept of three different voices from the recorded stimuli, rather than recording three actual different people.”
Before the test, the researchers analysed the differences in acoustic parameters between the voice sounds and phoneme sounds, such as frequency--either high or low--temporal modulation--the perceived speed of the spoken sound--and spectral modulation--the way the sound energy is distributed over the various frequencies. High spectral modulations were determined to be most useful in differentiating the different voice samples, and that low spectral modulations along with quick temporal modulation were most useful in identifying differences in phonemes.
During the test itself, the subjects were asked to identify three specific speech sounds--/p/, /t/, or /k/, as in the pseudowords preperibion, gabratade, and ecalimacre--or identify whether the sample had been spoken by voice one, two, or three. During the test, their brains were scanned by a functional magnetic resonance imaging (fMRI) machine to monitor the blood oxygenation of the brain, a highly effective way to identify which parts of the brain were most active since more activity requires more oxygen than less active regions of the brain.
By using a computer model to analyze the fMRI results, the researchers found that the auditory cortex amplified the higher spectral modulations when it was tasked with differentiating voices and when it was asked to identify the specific phonemes in the samples, it focused on the faster temporal modulations and lower spectral modulations over other stimuli.
"The results show large similarities between the task information in the sounds themselves and the neural, fMRI data,” Golestani said.
This demonstrates that the auditory cortex processes the same sound differently depending on the specific task it is trying to perform, revealing the essential mechanics involved in how we listen to people who are speaking to us and how our brains distinguish between different voices. "This is the first time that it’s been shown, in humans and using non-invasive methods, that the brain adapts to the task at hand in a manner that’s consistent with the acoustic information that is attended to in speech sounds," said Rutten.
Solving the cocktail party problem with algorithms modeled on the auditory cortex
As our understanding of what goes on inside the auditory cortex grows and we discover more of the mechanics of the cocktail party effect, we can use these new insights to improve the way computer systems process the sound of the human voice. While natural language processing systems like Google's speech-to-text API are certainly powerful, their best algorithms for the cocktail party problem are still inadequate. It will be several years at least before neurological research on the auditory cortex yields the kind of breakthroughs that allow us to develop the right algorithms to reproduce the cocktail party effect in computers.
Until then, the kind of voice-controlled computer interfaces like those seen on Star Trek will remain out of reach. But the research into the auditory cortex shows a lot of promise and the data we've gleaned so far from neurological studies shows that further research of this region of the brain will likely reveal new neurological mechanics that are essential for developing efficient algorithms for the cocktail party problem.