Skip to content

OptimusPrimus/salsa

Repository files navigation

Estimated Audio–Caption Correspondences Improve Language-Based Audio Retrieval

This repository contains the implementation of [1], which was accepted at the DCASE Workshop 2024.

Our submission [2] to the DCASE Challenge 2024 based on the proposed method, took the first rank in task 8 [3].

PWC

Motivation: Missing Audio–Caption Correspondences

Audio retrieval systems are typically trained on audio–caption datasets (e.g., ClothoV2 [4]), which contain pairs of audios and corresponding descriptions $\{ (a_i, c_i)\}_{N=1 \dots N}$. Unfortunately, for these datasets, the pairwise correspondence between audio $i$ and caption $j$ is not known for $i \neq j$; it is, therefore, common practice (e.g., during contrastive training and during evaluation) to assume that pairs with $i \neq j$ do not match.

However, relying on this assumption is not ideal. The following paragraph shows a query and the five best-matching audio recordings in the ClothoV2 test set according to our retrieval model.

  • Recordings marked with ✅ are associated with the description ($i = j$), whereas
  • recordings marked with ❔ are associated with another caption ($i \neq j$); we thus do not know if the caption describes the audio.

(Hint: Use CLTR + click to open the recording in a new tab)

Query: A large gathering of people are talking loudly with each other.
Results: rank 1 ❔, rank 2 ❔, rank 3 ❔, rank 4 ❔, rank 5

All audio recordings marked with ❔ actually match the description, and should not be treated as non-matching audio recordings during training. We thus argue that additional correspondence annotations are required to give better guidance during training.

Estimating Audio–Caption Correspondences

Since there are (currently) no large-scale datasets with partial or complete correspondence annotations, we estimate them with one or multiple other audio-retrieval models.

The figure below illustrates the procedure:

system illustration

In stage 1, we assume that audio $a_i$ and caption $c_j$ do not match if $i \neq j$ and train the model with contrastive loss $L_{\textrm{sup}}$.

Stage 2 uses predictions ensembled from several Stage 1 models (bottom left) to estimate the correspondence between $a_i$ and $c_j$. Those estimates then serve as prediction targets instead of the ground truth from stage 1. Stage 2 model parameters are initialized with stage 1 parameters, and the corresponding loss is denoted as $L_{\mathrm{dist}}$.

The figure below shows the top 50 correspondences for three queries; the orange bar corresponds to the audio that is associated with the query ($i = j$).

system illustration

Setting up the environment

The following is a description on how to set up a conda environment on Ubuntu 18.04 for training and inference.

Create environment:

  • conda env create -f environment.yml
  • conda activate salsa
  • CFLAGS='-O3 -march=native' pip install https://github.com/f0k/minimp3py/archive/master.zip
  • sudo apt-get install p7zip p7zip-full p7zip-rar

Activate the environment:

  • conda activate salsa

Login to you wandb account:

  • wandb login

Test our pre-trained model on the ClothoV2 benchmark

Download ClothoV2 [4]:

  • run source scripts/download_clothov2.sh
  • the script downloads the dataset into a folder called clotho_v2

A checkpoint of the model is available here: https://cloud.cp.jku.at/index.php/s/ZZkWXQ7f3aXRXYW

Download and ensemble the checkpoint with this command:

  • run source scripts/download_checkpoint.sh

And then, use this command to test on the ClothoV2 benchmark

CUDA_VISIBLE_DEVICES=0 python -m experiments.ex_dcase24 cmd_test_on_clothov2 with \
data_loader.batch_size_eval=32 \
audio_features.segment_length=10 \
audio_features.model=passt \
sentence_features.model=roberta-large \
load_model=passt_roberta.ckpt

The expected performance is:

map@10 R@1 R@5 R@10
40.11 27.69 57.05 70.50

Training

The following section describes training on the ClothoV2 dataset. The section below details

Training was done on a single Nvidia A40 GPU.

Set up the environment and download the ClothoV2 dataset as described above.

Stage 1 training:

CUDA_VISIBLE_DEVICES=0 python -m experiments.ex_dcase24 with \
data_loader.batch_size=64 \
data_loader.batch_size_eval=32 \
audio_features.segment_length=10 \
audio_features.model=passt \
sentence_features.model=roberta-large \
rampdown_type=cosine \
max_epochs=20 \
rampdown_stop=15 \
warmup_length=1 \
rampdown_start=1 \
train_on=clothov2 \
seed=409194

The expected performance is:

map@10 R@1 R@5 R@10
28.20 17.24 42.31 56.47

The result will be stored in the model_checkpoints directory.

Estimate correspondences (replace mild-mountain-1 with the experiment name); the results are stored in the same directory as the checkpoint:

MODEL_NAME=mild-mountain-1
CUDA_VISIBLE_DEVICES=0 python -m experiments.ex_dcase24 cmd_generate_embeddings with \
data_loader.batch_size_eval=32 \
audio_features.segment_length=10 \
audio_features.model=passt \
sentence_features.model=roberta-large \
load_parameters=$MODEL_NAME

Stage 2 training:

MODEL_NAME=mild-mountain-1
CUDA_VISIBLE_DEVICES=0 python -m experiments.ex_dcase24 with \
data_loader.batch_size=64 \
data_loader.batch_size_eval=32 \
audio_features.segment_length=10 \
audio_features.model=passt \
sentence_features.model=roberta-large \
lr_audio_encoder=2e-5 \
lr_audio_project=2e-5 \
lr_sentence_encoder=2e-5 \
lr_sentence_project=2e-5 \
rampdown_type=cosine \
max_epochs=20 \
rampdown_stop=15 \
warmup_length=1 \
rampdown_start=1 \
train_on=clothov2 \
load_parameters=$MODEL_NAME \
load_last=best \
loss_weight=0.0 \
distill_weight=1.0 \
distill_from=$MODEL_NAME \
seed=523528930

The test performance should improve to:

map@10 R@1 R@5 R@10
29.94 18.48 45.58 60.40

To further improve the performance, train more stage 1 models (ATST, MN, ...), generate audio–caption correspondences, and add model names to distill_from; this argument takes a list of models, with models separated via a semicolon (;). Use quotes to escape the semicolons, e.g., "distill_from=model-1;model-2,model-3"

Additional Datasets

The system needs to be trained on ClothoV2, AudioCaps, and WavCaps, to achieve state-of-the-art performance.

First, download WavCaps [5]:

  • run source scripts/download_wavcaps.sh
  • the script downloads the dataset into a folder called wavcaps

Then download AudioCaps [6]:

  • unfortunately, audio recordings of AudioCaps are not publicly available
  • you can download the data set yourself or reach out to me for the download link (for research purposes only)
  • replace the links in scripts/download_audiocaps.sh
  • run source scripts/download_audiocaps.sh
  • the script downloads the compressed dataset into a folder called tmp

Finally, set flag train_on=all in stage 1 (train_on=clothov2 in stage 2) and repeat the training procedure described above.

References

  • [1] P. Primus, F. Schmid, and G. Widmer, “Estimated Audio--Caption Correspondences Improve Language-Based Audio Retrieval”, under review
  • [2] P. Primus, and G. Widmer, “A Knowledge Distillation Approach to Improving Language-Based Audio Retrieval Models,” DCASE2024 Challenge, Tech. Rep., June 2024
  • [3] H. Xie, S. Lipping, and T. Virtanen, "Language-Based Audio Retrieval Task in DCASE 2022 Challenge", in Proc. of the Detection and Classification of Acoustic Scenes and Events Workshop, DCASE, Nancy, France, 2022
  • [4] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: an Audio Captioning Dataset,” in Proc. of the IEEE Int. Conf. Acoustic., Speech and Signal Process., ICASSP, Barcelona, Spain, 2020
  • [5] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” CoRR, vol. abs/2303.17395, 2023.
  • [6] C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating captions for audios in the wild,” in Proc. of the North American Ch. of the Ass. for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2019.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published