You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Diff-A-Riff is a Latent Diffusion Model capable of generating instrumental accompaniments for any musical audio context.
It relies on a pretrained consistency model-based Autoencoder (CAE) and a generative model operating on its latent embeddings. The LDM follows the framework of Elucidated Diffusion Models (EDMs).
Given a pair of input context and target accompaniment audio segments, the model is trained to reconstruct the accompaniment given the context and a CLAP embedding derived from a randomly selected sub-segment of the target itself. At inference time, Diff-A-Riff allows to generate single instrument tracks under different conditioning signals.
First, a user can choose to provide a context, which is a piece of music that the generated material has to fit into. If provided, the context is encoded by the CAE to give a sequence of latents that we call $$\textit{Context}$$. When a context provided, we talk of accompaniment generation instead of single instrument generation.
Then, the user can also rely on CLAP-derived embedings to further specify the material to be generated. CLAP provides a multimodal embedding space shared between audio and text modalities. This means that the user can provide either a music reference or a text prompt, which after being encoded in CLAP give $$\textit{CLAP}\text{A}$$ and $$\textit{CLAP}\text{T}$$ respectively.
Sound Examples
In this section, we demonstrate the implicit controls of Diff-A-Riff.
Inpainting
As done traditionally with diffusion models, Diff-A-Riff can be used to perform audio inpainting. At each denoising step, the region to be kept is replaced with the corresponding noisy region of the final output, while the inpainted region is denoised normally. This way, we can enforce the inpainted region to blend well with the surroundings.
In line with most diffusion models, Diff-A-Riff allows to perform audio inpainting. During each denoising iteration, the inpainted area undergoes standard denoising, while the region to keep is substituted with its noisy counterpart from the final output. This approach ensures seamless integration of the inpainted section with its surroundings.
In the following examples, all tracks are inpainted from second 5 to 8.
Context
Masked Original Accompaniment
Inpainting w/ Mix
Inpainting Solo
Text-Audio Interpolation
We can interpolate between different references in the CLAP space. Here, we demonstrate the impact of a interpolation between an audio-derived CLAP embedding and a text-derived one, with an interpolation ratio $$r$$.
Audio Reference
$$ r = 0 $$
$$ r = 0.2 $$
$$ r = 0.4 $$
$$ r = 0.6 $$
$$ r = 0.8 $$
$$ r = 1 $$
Text Prompt
"Latin Percussion"
"Oriental Percussion Texture"
"The Catchiest Darbouka Rhythm"
"Rhythm on Huge Industrial Metal Plates"
Zero-Shot Slider
After generating an accompaniment for a context, based on a text or audio
reference, we can adjust the generated accompaniment in arbitrary ways using the
"zero-shot slider": We define the meaning of the left and right slider positions
by writing two (opposing) text prompts, and adjust the generated accompaniment
in the direction of the left or right prompts by moving the slider.
Example 1
Example 2
Example 3
Example 4
Generate an accompaniment for a context, based on a text/audio reference
Context
Text/Audio Reference
"electric guitar"
"percussive drum rhythm with kick and snare"
"acoustic guitar"
Result (context + accompaniment)
Make zero-shot slider adjustments of the generated accompaniment
Left slider prompt
"soft, sweet and tranquil"
"funny childrens toys"
"sad and mellow"
"legato, sustained notes"
(original)
Right slider prompt
"hard hitting aggressive"
"deep and ominous"
"happy and delirious"
"staccato, short accentuated notes"
Variations
Given an audio file, we can encode it in the CAE latent space and get the corresponding latent sequence. By adding noise to it, and denoising this noisy sequence again, we end up with a variation of the first sequence. We can then decode it to obtain a variation of the input audio. The amount of noise added is controlled through the variation strength parameter $$s_\text{Var}$$, which allows to control how different to the original a variation can be.
Reference
$$s_\text{Var} = 0.2$$
$$0.5$$
$$0.8$$
Stereo width
Following the same principle as for variations, for any mono signal, we can create a slight variation of it and use the original and the variation as left and right channels, creating what we call pseudo-stereo. Here, you can find examples of pseudo stereo files, generated from different stereo width $$s_\text{Stereo}$$.
$$s_\text{Stereo} = 0$$
$$0.2$$
$$0.4$$
$$0.5$$
Loop Sampling
By repeating a portion of the data being denoised, we can enforce repetitions in the generated material. We can enforce this repetition for a fraction $$s_\text{Loop}$$ of the diffusion steps, and let the model denoise normally for the remaining last steps to introduce slight variations.
Number of Repetitions
$$s_\text{Loop}$$
Generations
2
0.5
0.8
1
4
0.5
0.8
1
Ethics Statement
Sony Computer Science Laboratories is committed to exploring the positive applications of AI in music creation. We collaborate with artists to develop innovative technologies that enhance creativity. We uphold strong ethical standards and actively engage with the music community and industry to align our practices with societal values. Our team is mindful of the extensive work that songwriters and recording artists dedicate to their craft. Our technology must respect, protect, and honour this commitment.
Diff-A-Riff supports and enhances human creativity and emphasises the artist's agency by providing various controls for generating and manipulating musical material. By generating a stem at a time, the artist remains responsible for the entire musical arrangement.
Diff-A-Riff has been trained on a dataset that was legally acquired for internal research and development; therefore, neither the data nor the model can be made publicly available. We are doing our best to ensure full legal compliance and address all ethical concerns.
For media, institutional, or industrial inquiries, please contact us via the email address provided in the paper.
Extra References
[1] Barry, Dan, et al. "Go Listen: An End-to-End Online Listening Test Platform." Journal of Open Research Software 9.1 (2021)
[2] Liutkus, Antoine, et al. "The 2016 signal separation evaluation campaign." Latent Variable Analysis and Signal Separation: 13th International Conference, LVA/ICA 2017, Grenoble, France, February 21-23, 2017, Proceedings 13. Springer International Publishing, 2017.
<script>
// upon play of one audio element stop others
for (audio of document.getElementsByTagName("audio")) {
audio.addEventListener("play", function(event) {
for (other of document.getElementsByTagName("audio")) {
if ( this != other ) {
other.pause();
other.currentTime = 0;
}
}
});
}
</script>