EvoMusic is an adaptive music generation system designed to evolve music in alignment with user preferences. By analyzing user interactions, it continuously refines its understanding of musical tastes and generates personalized compositions.
At its core, EvoMusic combines a music scoring mechanism, user feedback modeling, conditional music generation, and evolutionary strategies. The system follows a loop where it evolves music based on inferred preferences, generates a playlist, collects feedback, and fine-tunes its understanding of user tastes. This iterative process ensures that the music adapts dynamically to each user.
More details about the project can be found in the technical report:
Click to download report
├── EvoMusic
│ ├── evolution
│ │ ├── evolve.py
│ │ ├── fitness.py
│ │ ├── logger.py
│ │ ├── operators.py
│ │ ├── problem.py
│ │ └── searchers.py
│ ├── music_generation
│ │ ├── generators.py
│ │ ├── init_prompts.txt
│ │ ├── musicgen_server.py
│ │ ├── musicLDM_server.py
│ │ └── riffusion_server.py
│ ├── user_embs
│ │ └── model.py
│ ├── usrapprox
│ │ ├── models
│ │ │ └── usr_emb.py
│ │ └── utils
│ │ ├── dataset.py
│ │ ├── memory.py
│ │ ├── user_manager.py
│ │ ├── user.py
│ │ ├── user_train_manager.py
│ │ └── utils.py
│ ├── application.py
│ └── configuration.py
├── generated_audio
│ └── ...
├── img
├── usrapprox
│ └── ...
├── usrembeds
│ ├── checkpoints
│ │ └── ...
│ ├── data
│ │ └── ...
│ ├── datautils
│ │ └── dataset.py
│ ├── exp
│ ├── models
│ │ ├── __init__.py
│ │ └── model.py
│ ├── align.py
│ ├── embedgen_MERT.py
│ ├── embedgen.py
│ ├── __init__.py
│ ├── main.py
│ ├── test.py
│ └── utils.py
├── visualizations
│ ├── music_embedding.py
│ └── users.py
├── application.py
├── evolution_pipeline.py
├── README.md
└── setup.py
The EvoMusic pipeline aims to generate music tailored to user preferences using evolutionary algorithms. It avoids costly retraining and excessive user input by dynamically refining generated music through interactions. The system integrates four key components: a music scorer that evaluates alignment with user tastes, an approximation model that infers user preferences from feedback, a conditional music generator, and evolutionary strategies to refine music iteratively.
Music evolution employs both text-based prompt optimization and direct embedding optimization. The prompt-based approach utilizes large language models (LLMs) to refine text prompts through evolutionary techniques like roulette-wheel selection, novelty injection, and elite retention. Two strategies were tested: Full LLM, which optimizes prompts via meta-prompt reasoning, and LLM Evolve, which applies genetic operators like crossover and mutation to text prompts.
Embedding optimization directly manipulates the token embeddings used in the conditional music generator. Methods tested included CMA-ES, SNES, and genetic algorithms, all of which search the latent space for optimized solutions. Genetic algorithms showed promise in balancing exploration and exploitation, while numerical techniques struggled due to the sparsity of the embedding space.
This work addresses the challenge of measuring song quality, which is subjective and difficult to quantify. We propose AlignerV2, a model that aligns user and music embeddings in a shared latent space, allowing similarity calculations between them. Since real user embeddings are not available, synthetic user embeddings are generated from a large dataset of music interactions (Last.fm). The MERT encoder is used to extract music embeddings, leveraging all 13 hidden layers through a gating mechanism. Finally, we can calculate the similarity between user and music embeddings, allowing us to approximate a fitness function for the downstream tasks. AlignerV2 is trained with contrastive learning using InfoNCE loss with learnable temperature
We implemented a GUI in order to facilitate the user interaction with the system. The GUI allows the user to listen to the generated music and provided feedback. An emergent behavior surfaced by injecting real human feedback in the loop was a great improvement in the quality of the generated music for the numerical optimization methods, where the synthesized feedback would sometimes fail to weed out noisy solutions.
To install the required packages, run the following commands:
Create and activate conda environment with Python 3.10:
conda create -n evomusic python=3.10
conda activate evomusic
Install the required packages via setup.py
:
pip install -e .
Ensure the needed files are in the correct directories:
- AlignerV2_best_model.pt in
usrembeds/checkpoints/
checkpoint - embeddings_full_split in
usrembeds/data/embeddings/
To run the application you need to provide to evolution_pipeline.py
a valid configuration file. The configuration file is in YAML format and contains the field that describes the parameters of the pipeline. An example of a configuration file is available under the directory example_conf/
.
python evolution_pipeline.py --config_path example_conf/config.yaml
In the configuration file it's possible to specify what music generation model to use in the pipeline, EvoMusic supports two different models: Riffusion and MusicGen, both can be configured and the parameter music_model
dictates which one to use.
music_generator:
model: "facebook/musicgen-small"
input_type: "text"
output_dir: "generated_audio"
name: "musicgen"
riffusion_pipeline:
input_type: "text"
output_dir: "generated_audio"
name: "riffusion"
inference_steps: 50
music_model: "musicgen"
input_type
specifies the type of input that the model expects, it can be either text, embedding or token_embdedding.output_dir
specifies the directory where the generated audio files will be saved.
In user_model
it's possible to configure the user embedding model, the user configuration and the training configuration.
user_model:
aligner:
abs_file_path: "usrembeds/checkpoints/AlignerV2_best_model.pt"
user_conf:
memory_length: 50
amount: 1
init: "rmean" # mean rand rmean
train_conf:
lr: 0.001
epochs: 5
best_solutions: 10
users:
- user_type: "synth"
target_user_id: 0
aligner
specifies the path to the alignment model checkpoint.user_conf
specifies the configuration of the user embedding model, it's possible to specify thememory length
(the number of previous generations to consider in the approximated user training), theamount
(the number of users to approximate)and theinit
method which specifies how to initialize the user embedding. It can be either mean (initialize the user embedding with the average of all the synthesized users), rand (initialize the user embedding with random values) or rmean (same as mean but with random noise).train_conf
specifies the training configuration of the user embedding model, it's possible to specify thelearning rate
and the number ofepochs
.best_solutions
sets the number of best solutions from each generation to be user for the user embedding approximation.users
specifies the users to approximate, it's possible to specify theuser_type
which can be either synth or real and thetarget_user_id
which specifies id of the user to approximate among the 987 available.
Under the evolution section of the configuration file we can specify the parameters of the genetic algorithm as well as the logging and LLM options.
evolution:
exp_name: "base"
generations: 5
max_seq_len: 25
duration: 1
best_duration: 3
device: "cpu"
initialization: "file"
init_file: "EvoMusic/music_generation/init_prompts.txt"
logger:
wandb: True
project: "MusicEvo"
wandb_token: "WANDB_TOKEN"
visualizations: False
LLM:
api_key: "API_KEY"
temperature: 0.7
model: "gpt-4o-mini"
api_uri: "https://api.openai.com/v1/chat/completions" # needs to be an OpenAI API compatible endpoint
exp_name
specifies the name of the experiment.generations
specifies the number of generations that the genetic algorithm will run for each evolve step.max_seq_len
specifies the maximum length in tokens of the generated sequence (including 2 tokens for the start and end tokens).duration
specifies the duration of the generated audio files in seconds.best_duration
specifies the duration in seconds of the best solution for each generation.device
specifies the device to use for computation for the evolutionary strategy.initialization
specifies the initialization method for the population, it can be either LLM or file. If the initialization is set to file theinit_file
parameter specifies the path to the file containing the initial prompts, otherwise the LLM section specifies the parameters for the LLM initialization, any OpenAI API compatible endpoint can be used.logger
specifies the logging options for the pipeline, it's possible to log the results to WandB and to save the visualizations of the population, it's recommended to set wandb to True and to provide a valid WandB token.
fitness:
mode: "user" # can either be user, music or dynamic
target_music: "" # path to the target music for mode music
noise_weight: 0.5 # noise weight for the fitness function
mode
specifies the mode of the fitness function, it can be either user, music or dynamic. If the mode is set to user the fitness function will use as target the user embedding specified in theuser_model
section, if the mode is set to music the fitness function will use as target the music specified in thetarget_music
field, if the mode is set to dynamic the fitness function will use the dynamically approximated user embedding.noise_weight
specifies the weight of the penalty for the noise and artifacts in the generated audio, it's not recommended using it with LLM and LLM evolve modes.
Under the search
section you can specify the parameters of the evolutionary strategy, the available modes are LLM evolve, full LLM, GA, CMAES and SNES.
search:
mode: "LLM evolve"
# general search parameters
population_size: 100
sample: True
temperature: 0.1 #(note: the original values are [-1,1] so we advise lower values)
novel_prompts: 0.1
elites: 0.1
mode
specifies the mode of the evolutionary strategy, it can be either:- full LLM which uses the LLM to generate the whole population and performs the evolutions and crossover, the prompt can be set via the full_LLM_prompt field.
- LLM evolve which uses the LLM to generate the initial population and then evolves it using LLM-based operators that will be defined in the LLM_genetic_operators section.
- GA which uses the genetic algorithm to evolve the population.
- CMAES which uses the CMAES algorithm to evolve the population.
- SNES which uses the SNES algorithm to evolve the population.
GA, CMAES and SNES use the evotorch library implementation of the algorithms.
population_size
specifies the size of the population.sample
specifies whether to use sampling for all operations. [LLM modes only]temperature
specifies the temperature to use for the sampling. [LLM modes only]novel_prompts
specifies the fraction of the population to create ex-novo. [LLM modes only]elites
specifies the fraction of the population to keep from the previous generation.
In order to be interpreted by EvoMusic, the prompt for the full LLM mode must be formatted as follows:
full_LLM_prompt:
"Generate {num_generate} music prompts ...
Here is the current population with their similarity scores and ranking for the current generation:
{ranking}
after the reasonin, generate only the next generation of prompts with a population of {num_generate} prompts."
More detailed instructions can be added in the prompt as long as it follows the same general structure as the example above.
For what concerns the LLM evolve mode, the genetic operators can be defined in the configuration file under the LLM_genetic_operators
field. The operators are applied to the whole population one by one sequentially, and you can create operators that apply multiple operations at the same time by describing what you want the LLM to do. Do not use anywhere the tags, as they are used to extract the final output from the LLM
# LLM evolve parameters
tournament_size: 5 # size of the tournament for selection
LLM_genetic_operators:
# genetic operators to use when using the LLM evolve mode
- name: "cross over"
prompt: "take the two prompts provided and cross them over by mixing components of both {prompts}"
input: 2 # number of parents
output: 1 # number of children
probability: 0.5 # the probability of applying the operator
- name: "change genere"
prompt: "take the prompt and change the genre of the music used {prompts}"
input: 1 # number of parents
output: 1 # number of children
probability: 0.5
- name: "random mutation"
input: 1 # number of parents
output: 1 # number of children
prompt: "take the prompt and mutate it, you can choose to mutate any of the words in the prompt {prompts}"
probability: 0.5
name
specifies the name of the operator.prompt
specifies the prompt to use for the operator, so the operation that the LLM should perform.input
specifies the number of parents that the operator needs.output
specifies the number of children that the operator will generate.probability
specifies the probability of applying the operator.
When using GA, CMAES or SNES the parameters can be defined under the evotorch
field, any parameter available in the evotorch library can be specified here.
evotorch: # additional parameters for the search algorithm when using evotorch's algorithms
elitist: True
GA_operators:
- name: "OnePointCrossOver"
parameters:
tournament_size: 4
cross_over_rate: 0.5
- name: "GaussianMutation"
parameters:
stdev: 20
elitist
specifies whether to use elitism in the algorithm, if set to false it will override theelites
parameter in thesearch
section.GA_operators
specifies the genetic operators to use when using the GA algorithm, the available operators are OnePointCrossOver, GaussianMutation, CosynePermutation, MultiPointCrossOver, PolynomialMutation, SimulatedBinaryCrossOver and TwoPointCrossOver. Check the documentation of theevotorch
library for more information on the parameters of the operators.