Researchers from the University of Washington have developed a new deep learning technology that allowed them to craft highly realistic videos by overlapping audio clips to authentic video references. They were able to synchronize audio clips of former US president Barrack Obama to four different video scenarios he appeared in.
Lip-syncing 'wild' video content to create synthetic but realistic videos
Previous attempts in syncing audio specimens to video clips are easily deemed as fake and, most of the time, creepy or unpleasant to watch. However, the new algorithm developed by the University of Washington was able to smoothly synchronize audio and video clips, which overcame a common trouble in creating realistic videos known as the uncanny valley. Supasorn Suwajanakorn, the lead author of the published paper, noted the complexity of the process of lip-syncing a video footage.
"People are particularly sensitive to any areas of your mouth that don’t look realistic. If you don’t render teeth right or the chin moves at the wrong time, people can spot it right away and it’s going to look fake. So you have to render the mouth region perfectly to get beyond the uncanny valley".
[Image Source: University of Washington]
Suwajanakorn and his team of researchers used a two-step technique in crafting their highly realistic videos. First, they had to train a neural network to process videos of a specific person and correspond various audio sounds into basic mouth shapes. They then used a technology from a previous research by the UW Graphics and Image Laboratory to overlap and combine the determined mouth shapes on top of existing reference videos. One of the other tricks they picked up on was to permit a small time shift to allow the neural network to predict what the subject is about to say. Essentially, Suwajanakorn managed to develop algorithms that have the ability to learn from videos found all across the internet, or as the researchers put it, found "in the wild".
"There are millions of hours of video that already exist from interviews, video chats, movies, television programs and other sources. And these deep learning algorithms are very data hungry, so it’s a good match to do it this way", said the lead author.
Potential use of the deep learning technology
One of the researchers in the team has thought of a science fiction type application for the technology. Ira Kemelmacher-Shlizerman, an assistant professor at the University's School of Computer Science & Engineering, said that the new algorithm can be used for everyday events as well as in futuristic settings.
"Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. This is the kind of breakthrough that will help enable those next steps".
The deep learning technology could also be used to address a common virtual communication trouble where streaming live videos are often lagged and frustrating to put up with. Whereas audio connection is typically streamed in real-time without lagging.
"When you watch Skype or Google Hangouts, often the connection is stuttery and low-resolution and really unpleasant, but often the audio is pretty good", said Steve Seitz, co-author of the paper. "So if you could use the audio to produce much higher-quality video, that would be terrific", he added.
The team's technology could also be developed and enhanced to equip it with algorithms that are capable of detecting whether a video is authentic or manufactured. They are also looking to advance their technology so it can study and process an individual's voice and speech using fewer data. By doing so, it will cut down the process time to only an hour instead of around 14 hours.
Featured Image Source: Supasorn Suwajanakorn/YouTube