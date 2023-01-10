Major tech companies such as Google, Meta, and Microsoft have also been working in this space to make their products more accessible. However, these products are not aimed at mimicking a user's voice and need countless hours of training to be able to do so and come off poorly.

VALL-E's mind-boggling capabilities

Conventionally TTS generators rely on manipulating waveforms to synthesize speech. VALL-E, on the other hand, generates discrete audio codecs from text and audio prompts and uses them to match it to what it knows about how the voice would sound if it spoke other phrases.

The research team claims that the audio prompt, in this case, could be as short as three seconds, and that would be sufficient for VALL-E to do its job. This makes VALL-E a zero-shot TTS generator, where the software observes samples that it has not observed during training.

Interestingly, VALL-E's training was conducted using LibriLight, an audio library that was put together by Meta and contained nearly 60,000 hours of English language speech from the LibriVox audiobooks that are available in the public domain.

What VALL-E successfully does is match the three-second audio sample to the voice of one of the 7,000 people that it has trained and then deliver the text in a voice similar to that in the training data to deliver an accurate mimic response.

Microsoft claims that VALL-E can not only simulate the voices in an acoustic environment, such as a phone call but also deliver the speech in accordance with the emotion used in the speaker prompt, making it much more personalized and natural.