Riffusion tweaks Stable Diffusion to make AI text to image spectrograms play audio
Stable Diffusion has been tweaked to include an update to its AI routines to include a fine-tuning of the images of spectrograms that are paired to text. Now they are able to generate more precise sounds. The team calls their version of the stable diffusion model, Riffusion.
There is audio processing, also but that happens later in the cycle or downstream of the model.
All the Stable Diffusion features remain
The tweaks have enabled the system to now have unlimited variations of the prompt. This is accomplished by varying the seed. The out-of-the-box features are all there, including img2img, negative prompts, interpolation, and negative prompts.
What is a spectrogram?
Spectrograms are visual representations of audio sound waves, like someone singing or talking. In an audio spectrogram, the sound is represented and mapped on a graph. The X-Axis is the time duration, and the Y-Axis is the frequency of the sound.
Each frequency can have a time designation and is represented by the color of the pixels, which gives the amplitude. The time goes by the row and column in the image.
Stable Diffusion uses a Short-time Fourier transform (STFT) to compute the spectrogram image, The STFT approximates the sound using a series of sine waves in the various phases and amplitudes.
The STFT denotes the frequency and phase content of local signals as they change over time. These signal variations can be calculated, inverted and then displayed in a spectrogram.
How a spectrogram becomes sound
In Stable Diffusion's model the amplitude of the sine wave, but not the phases of the audio. This is due in large part to the chaotic nature of phases. This shifting of phases is hard for the AI to learn.
The model uses instead, a particular algorithm called the Griffin-Lim in order to approximate the shifts in phase, as the audio clip is reconstructed into a spectrogram.
The spectrogram is generated and further along in the AI system is downstream converted to audio.
Diffusion models for Image to Image generation
It is possible to condition the creations of the diffusion model to work with text and images. The Stable Diffusion team states that this is a useful modifier to sounds, while still keeping the original sound slip.
There can be deviations to the new image prompt from many different types of original clips using the denoising strength parameter. In these spectrogram images, you can handle isolating sounds as if having multiple tracks on a recording tape, generated from an original image.
Generating long audio clips
The shorter audio clips, generated from the spectrogram, are very exciting, but the real function of the Stable Diffusion AI system is to generate infinite AI-generated audio, like songs and instrumentals.
If the AI is used on 100 clips there can be no concatenating of those clips, because of all the differences in tempo, key, tonal quality, and rhythm. The way to get around this problem is to use one image.
Taking one image and generating multiple variations of the original in an image to image denoising for isolating particular instruments and sounds. Using different seeds and prompts preserves the original vital properties of the clips.
Utilizing at the beginning of the process an initial image that are an exact number of measures. A beginning and an end, and then creating a loop of clip. This still doesn't make the entire audio clip a smooth sounding song from beginning to end.
To make it smooth as jazz
The multiple interpretations of the images one to the next causes abrupt changes in the playback. To create smooth transitions the seed and prompts have to be interpolated in the latent space of the model. This happens in diffusion models, the latent space is a feature vector which embeds the entire model in the most possible space.
This allows the model to generate smoother transitions and create close relationships between prompts and seed. At this point the model can be set in motion infinitely in long playback of the clips, or as a song generated from spectrograms in real-time.