Whisper is an open source automatic speech recognition on the fly
OpenAI, the nonprofit aiming to develop and direct artificial intelligence (AI) to help the whole of humanity, open-sourced (released) Whisper. Whisper is an automatic speech recognition system that OpenAI said will enable ‘robust” transcription in multiple languages. Whisper will also translate those languages into English, automatically.
Automatic speech recognition (ASR) has always challenged AI and machine learning. OpenAI is taking a step in a positive direction with Whisper.
Many variations on a theme
There have been countless versions of highly capable speech recognition systems, which work at the heart of software and services from the likes of giants in technology including Google, Meta, and Amazon. What makes Whisper different is that it was trained on 680,000 hours of multilingual and multitask data collected from the web.
This led to improved recognition of unique accents, background noise variants, and technical terminology and jargon.
“The primary intended users of Whisper models are AI researchers, studying robustness, capabilities, biases, generalization, and constraints of the current model. However, Whisper is also potentially useful as an automatic speech recognition solution for developers, especially for English speech recognition.” OpenAI said in a GitHub repo, (program notes) for Whisper. Anyone can download Whisper from GitHub; it is entirely free to use.
The models are showing strength
Also, in the repo, OpenAI wrote “The models show strong ASR results in about 10 languages. They may exhibit additional capabilities, if fine-tuned on certain tasks, like voice activity detection, speaker diarization, and speaker classification. But have not been robustly evaluated in these areas.”
There are some limits
Whisper limitations are found in particular areas, such as text prediction. The system was trained on a great deal of “noisy” data, so OpenAI cautions that Whisper might include words in its transcriptions that weren’t actually spoken. This may be related to trying to predict the next word in audio and trying to transcribe the audio at the same time.
Furthermore, Whisper doesn’t perform equally well across languages. The system does suffer from a higher error rate when it comes to speakers of languages that aren’t well represented in the training data or models.
There is the ever-present racial problem
This is not something new to the world of ASR, unfortunately. Biases have long plagued the best of systems with a 2020 study from Stanford had seen fewer overall errors in big tech company’s ASR, like Amazon, Apple, Google, Microsoft, and IBM – far fewer – about 19% - with users who were white, than with users who were Black.
Despite the problems Whisper thrives
Despite the problem OpenAI sees Whisper’s transcription capabilities being an overall improvement to the existing accessibility tools.
The company goes on to say on GitHub “While Whisper models cannot be used for real-time transcription out of the box, their speed and size suggest that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation. The real value of the beneficial applications built on top of whisper models suggests that the disparate performance of these models may have real economic implications.”
The release of Whisper isn’t necessarily indicative of OpenAI future plans. The company is also focused on the more commercial efforts of DALL-E 2 and GPT-3. There are always purely theoretical pursuits at OpenAI, one of which is AI systems that learn by observing videos.