Implicit Controls in Text-To-Music Diffusion Models

This is the accompanying website to "Implicit Controls in Text-To-Music Diffusion Models".

System Overview
Sound Examples
Ethics Statement
Extra References

System Overview

Diff-A-Riff is a Latent Diffusion Model capable of generating instrumental accompaniments for any musical audio context.

It relies on a pretrained consistency model-based Autoencoder (CAE) and a generative model operating on its latent embeddings. The LDM follows the framework of Elucidated Diffusion Models (EDMs).

Given a pair of input context and target accompaniment audio segments, the model is trained to reconstruct the accompaniment given the context and a CLAP embedding derived from a randomly selected sub-segment of the target itself. At inference time, Diff-A-Riff allows to generate single instrument tracks under different conditioning signals.

First, a user can choose to provide a context, which is a piece of music that the generated material has to fit into. If provided, the context is encoded by the CAE to give a sequence of latents that we call $$\textit{Context}$$. When a context provided, we talk of accompaniment generation instead of single instrument generation.
Then, the user can also rely on CLAP-derived embedings to further specify the material to be generated. CLAP provides a multimodal embedding space shared between audio and text modalities. This means that the user can provide either a music reference or a text prompt, which after being encoded in CLAP give $$\textit{CLAP}\text{A}$$ and $$\textit{CLAP}\text{T}$$ respectively.

Sound Examples

In this section, we demonstrate the implicit controls of Diff-A-Riff.

Inpainting

As done traditionally with diffusion models, Diff-A-Riff can be used to perform audio inpainting. At each denoising step, the region to be kept is replaced with the corresponding noisy region of the final output, while the inpainted region is denoised normally. This way, we can enforce the inpainted region to blend well with the surroundings.

In line with most diffusion models, Diff-A-Riff allows to perform audio inpainting. During each denoising iteration, the inpainted area undergoes standard denoising, while the region to keep is substituted with its noisy counterpart from the final output. This approach ensures seamless integration of the inpainted section with its surroundings. In the following examples, all tracks are inpainted from second 5 to 8.

Context	Masked Original Accompaniment	Inpainting w/ Mix	Inpainting Solo
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
		Your Browser does not support the audio tag	Your Browser does not support the audio tag
		Your Browser does not support the audio tag	Your Browser does not support the audio tag
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
		Your Browser does not support the audio tag	Your Browser does not support the audio tag
		Your Browser does not support the audio tag	Your Browser does not support the audio tag
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
		Your Browser does not support the audio tag	Your Browser does not support the audio tag
		Your Browser does not support the audio tag	Your Browser does not support the audio tag

Text-Audio Interpolation

We can interpolate between different references in the CLAP space. Here, we demonstrate the impact of a interpolation between an audio-derived CLAP embedding and a text-derived one, with an interpolation ratio $$r$$.

Audio Reference	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
$$ r = 0 $$	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
$$ r = 0.2 $$	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
$$ r = 0.4 $$	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
$$ r = 0.6 $$	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
$$ r = 0.8 $$	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
$$ r = 1 $$	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Text Prompt	"Latin Percussion"	"Oriental Percussion Texture"	"The Catchiest Darbouka Rhythm"	"Rhythm on Huge Industrial Metal Plates"

Zero-Shot Slider

After generating an accompaniment for a context, based on a text or audio reference, we can adjust the generated accompaniment in arbitrary ways using the "zero-shot slider": We define the meaning of the left and right slider positions by writing two (opposing) text prompts, and adjust the generated accompaniment in the direction of the left or right prompts by moving the slider.

	Example 1	Example 2	Example 3	Example 4
Generate an accompaniment for a context, based on a text/audio reference
Context	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Text/Audio Reference	"electric guitar"	"percussive drum rhythm with kick and snare"	"acoustic guitar"	Your Browser does not support the audio tag
Result (context + accompaniment)	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Make zero-shot slider adjustments of the generated accompaniment
Left slider prompt	"soft, sweet and tranquil"	"funny childrens toys"	"sad and mellow"	"legato, sustained notes"
	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
(original)	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Right slider prompt	"hard hitting aggressive"	"deep and ominous"	"happy and delirious"	"staccato, short accentuated notes"

Variations

Given an audio file, we can encode it in the CAE latent space and get the corresponding latent sequence. By adding noise to it, and denoising this noisy sequence again, we end up with a variation of the first sequence. We can then decode it to obtain a variation of the input audio. The amount of noise added is controlled through the variation strength parameter $$s_\text{Var}$$, which allows to control how different to the original a variation can be.

Reference	$$s_\text{Var} = 0.2$$	$$0.5$$	$$0.8$$
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag

Stereo width

Following the same principle as for variations, for any mono signal, we can create a slight variation of it and use the original and the variation as left and right channels, creating what we call pseudo-stereo. Here, you can find examples of pseudo stereo files, generated from different stereo width $$s_\text{Stereo}$$.

$$s_\text{Stereo} = 0$$	$$0.2$$	$$0.4$$	$$0.5$$
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag

Loop Sampling

By repeating a portion of the data being denoised, we can enforce repetitions in the generated material. We can enforce this repetition for a fraction $$s_\text{Loop}$$ of the diffusion steps, and let the model denoise normally for the remaining last steps to introduce slight variations.

Number of Repetitions	$$s_\text{Loop}$$	Generations
2	0.5	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
	0.8	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
	1	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
4	0.5	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
	0.8	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag
	1	Your Browser does not support the audio tag	Your Browser does not support the audio tag	Your Browser does not support the audio tag

Ethics Statement

Sony Computer Science Laboratories is committed to exploring the positive applications of AI in music creation. We collaborate with artists to develop innovative technologies that enhance creativity. We uphold strong ethical standards and actively engage with the music community and industry to align our practices with societal values. Our team is mindful of the extensive work that songwriters and recording artists dedicate to their craft. Our technology must respect, protect, and honour this commitment.

Diff-A-Riff supports and enhances human creativity and emphasises the artist's agency by providing various controls for generating and manipulating musical material. By generating a stem at a time, the artist remains responsible for the entire musical arrangement.

Diff-A-Riff has been trained on a dataset that was legally acquired for internal research and development; therefore, neither the data nor the model can be made publicly available. We are doing our best to ensure full legal compliance and address all ethical concerns.

For media, institutional, or industrial inquiries, please contact us via the email address provided in the paper.

Extra References

[1] Barry, Dan, et al. "Go Listen: An End-to-End Online Listening Test Platform." Journal of Open Research Software 9.1 (2021)

[2] Liutkus, Antoine, et al. "The 2016 signal separation evaluation campaign." Latent Variable Analysis and Signal Separation: 13th International Conference, LVA/ICA 2017, Grenoble, France, February 21-23, 2017, Proceedings 13. Springer International Publishing, 2017.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Implicit Controls in Text-To-Music Diffusion Models

System Overview

Sound Examples

Inpainting

Text-Audio Interpolation

Zero-Shot Slider

Variations

Stereo width

Loop Sampling

Ethics Statement

Extra References

Files

index.md

Latest commit

History

index.md

File metadata and controls

Implicit Controls in Text-To-Music Diffusion Models

System Overview

Sound Examples

Inpainting

Text-Audio Interpolation

Zero-Shot Slider

Variations

Stereo width

Loop Sampling

Ethics Statement

Extra References