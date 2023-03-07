What is Universal Speech Model?

In November 2022, Google unveiled its 1,000 Languages Initiative, a machine learning model that aims to bring inclusivity to billions of people around the globe by making it easier to access one thousand most spoken languages.

According to the blog post, the Universal Speech Model (USM) is a family of speech models that includes two billion parameters that have been trained on 12 million hours of speech and 28 billion sentences of text. Currently, the model is based on a little over 300 languages but is already in use in Google's products, such as YouTube.

If you have used Automatic Speech Recognition (ASR) while watching YouTube videos in a language that you are not familiar with, it is the USM that is making it easier to understand the content. Google researchers Yu Zhang and James Qin further elaborated on how the machine-learning model was trained.

The researchers state that the fundamental difficulty in training a model such as USM is access to enough data. In a conventional supervised learning approach, the audio data needs to be manually labeled or collected from a pre-existing transcription. This either turns out too expensive, time-consuming, or hard to find, depending on the language and its representation.

USM's overall training pipeline Google Research

Google instead used a self-supervised learning approach that leveraged audio-only data, which was available in large quantities across languages making it easier to scale. After self-supervised learning on audio, Google put the model through a second step where its quality and coverage were improved using text data and then fine-tuned it using downstream tasks such as ASR.