Microsoft unveils VALL-E, a text-to-speech AI that can be trained in just 3 seconds

Is AI coming for voice artists now?
Ameya Paleja
Representation of Microsoft VALL-E.
Microsoft VALL-E

1, 2

Researchers at technology major Microsoft have unveiled their latest text-to-speech (TTS) generator, VALL-E that can be trained to mimic anybody's voice in just three seconds. Unlike previous voice generators that sounded robotic, VALL-E sounds naturally human, and that may not be a very good thing.

Text-to-speech generators that gave voice to one of the greatest minds on the planet, Stephen Hawking, have come a long way. From reading messages on your smartphone to reading out pages from a book, these services are now everywhere and used by everyone.

Major tech companies such as Google, Meta, and Microsoft have also been working in this space to make their products more accessible. However, these products are not aimed at mimicking a user's voice and need countless hours of training to be able to do so and come off poorly.

VALL-E's mind-boggling capabilities

Conventionally TTS generators rely on manipulating waveforms to synthesize speech. VALL-E, on the other hand, generates discrete audio codecs from text and audio prompts and uses them to match it to what it knows about how the voice would sound if it spoke other phrases.

The research team claims that the audio prompt, in this case, could be as short as three seconds, and that would be sufficient for VALL-E to do its job. This makes VALL-E a zero-shot TTS generator, where the software observes samples that it has not observed during training.

Interestingly, VALL-E's training was conducted using LibriLight, an audio library that was put together by Meta and contained nearly 60,000 hours of English language speech from the LibriVox audiobooks that are available in the public domain.

What VALL-E successfully does is match the three-second audio sample to the voice of one of the 7,000 people that it has trained and then deliver the text in a voice similar to that in the training data to deliver an accurate mimic response.

Microsoft claims that VALL-E can not only simulate the voices in an acoustic environment, such as a phone call but also deliver the speech in accordance with the emotion used in the speaker prompt, making it much more personalized and natural.

Most Popular

What it could lead to

While this is a great leap for technology, it is not very surprising. This might be because it comes close on the heels of ChatGPT's success, where the algorithm can churn out essays for college students, and might as well write this piece if it wasn't so occupied.

OpenAI's other product, DALL:E, can dish out images in response to text prompts, and now Microsoft's technology could revive a long-deceased actor's voice in a future movie. The bottom line of these technologies appears to be the ability to save money for companies that could get the job done by paying a fraction of what it pays a human.

However, the technology could also be used to spoof another human by making a distress call or accessing sensitive information that is locked behind voice-enabled passwords. Microsoft may currently be holding the keys to avoid such manipulation, but as we have seen with AI tech before, it does not take very long for it to be copied and applied for a nefarious purpose.