- files: the folder contains all the files necessary to reconstruct the audio clips of the multimodal dataset.
- alignment_results: The folder contains the results of aligning audio files with transcripts. There are several subfolders (one per debate) and each subfolder contains as many .json files as the number of chunks into which the original audio file of the debate was divided. The alignment was performed with aeneas. For more details on the structure of .json files, please visit the aenas documentation.
- datasets: the folder contains several sub-folders (one per debate). Each sub-folder contains several .csv files representing the intermediate results of the final dataset construction process. In addition, for each debate there is a duplicates.txt file containing the duplicated sentences and the number of occurrences. In addition, there is a YesWeCan folder containing the contents of the original dataset (USElecDeb60to16).
- MM-USElecDeb60to16: is the official dataset folder. It currently contains the .csv
file corresponding to the new dataset and a
audio_clips
folder that will be created/populated after downloading and processing the audio files with the files in thefiles/audio_clips
folder. - transcripts: the folder contains several sub-folders (one per debate). Each sub-folder contains:
- the original transcript
- the plain version of the transcript
- a
splits
sub-folder containing the portions of text corresponding to each chunk
- debug.csv: is a debugging file containing the necessary information for downloading and trimming
the audio files of two debates. The columns in this file are the stars of the
dictionary.csv
file - dicionary.csv: this file contains the information needed to download and trim all debates.
The columns in the dataset are:
id
: debate identifier number. Corresponds to theid
of the debates inUSEleDeb60to16
andMM-USElecDeb60to16
link
: link to the corresponding YouTube video of the debatestartMin
: number of minutes to be cut from the beginning of the filestartSec
: number of seconds to be cut from the beginning of the fileendMin
: number of minutes to be cut from the end of the fileendSec
: number of seconds to be cut from the end of the file
- run_aeneas: folder containing the bash script needed to run aeneas
- audio_pipepline.py: Python script to perform operations for recontructing the audio part of MM-USElecDeb60to16
- full_pipeline.py: Python script to perform all dataset construction operations
- (i.e. text part, audio part, creation of folders, datasets and alignment)
- utils.py: contains all the functions needed to construct the dataset.
-
Download the folder
multimodal-debates
-
Install all the required packages. List of required packages can be found at
requirements.txt
(files/requirements.txt) -
Run audio_pipepline.py. While running the script, several folders will be created:
- audio_clips: After the clips have been generated, this folder will contain several sub-folders (one per debate), each of which will contain as many clips as there are text samples in the dataset for the specific debate.
- debates_audio_recordings: folder is empty and will be populated with several
sub-folders (one per debate). Each subfolder will contain:
- a splits subfolder containing the new audio files after splitting into chunks
- a version of the audio file _trim.wav corresponding to the trimmed version of the original
- the original audio file
-
The dataset will be available in the following folder MM-USElecDeb60to16.
In addition to the information present in the original dataset (please, see USElecDeb60to16 for detailed information), MM-USElecDeb60to16 contains 3 additional columns:
NewBegin
: the number of seconds corresponding to the beginning of the phrase with respect to the duration of the trimmed original audio fileNewEnd
: the number of seconds corresponding to the end of the phrase with respect to the duration of the trimmed original audio fileidClip
: the identifier of the audio clip corresponding to the sentence. Thisid
is needed to reconstruct the audio-text pairs for each part of the speech