diff --git a/README.md b/README.md index 3931e50..4a16c0c 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ - + # MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages @@ -17,119 +17,119 @@ - [CommonVoice](https://commonvoice.mozilla.org/en/datasets) + CommonVoice CC 0 6,732 bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv ✅ - [CoVoST2](https://github.com/facebookresearch/covost) + CoVoST2 CC 0 687 en, fr, it, es, pt, et, nl, sv, lv, sl ✅ - [CSS10](https://github.com/Kyubyong/css10) + CSS10 Public Domain 99 nl, fi, fr, de, el, hu, es ✅ - [EMU](https://ips-lmu.github.io/EMU.html) + EMU CC BY 3.0 56 pl ✅ - [EU Parliament](https://clarin-pl.eu/dspace/handle/11321/821) + EU Parliament CC BY 4.0 32 pl ✅ - [FLEURS](https://huggingface.co/datasets/google/fleurs) + FLEURS CC BY 4.0 215 bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv ✅ - [Large Corpus of Czech Parliament Plenary Hearings](https://lindat.cz/repository/xmlui/handle/11234/1-3126) + Large Corpus of Czech Parliament Plenary Hearings CC BY 4.0 444 cs ✅ - [LibriLight](https://github.com/facebookresearch/libri-light) + LibriLight Public Domain 57,706 en ❌ - [LibriTTS](https://www.openslr.org/60/) + LibriTTS CC BY 4.0 585 en ✅ - [LibriSpeech](https://www.openslr.org/12) + LibriSpeech CC BY 4.0 360 en ✅ - [LibriVoxDeEn](https://www.cl.uni-heidelberg.de/statnlpgroup/librivoxdeen/) + LibriVoxDeEn Public Domain 547 de ✅ - [MC Speech](https://github.com/czyzi0/the-mc-speech-dataset) + MC Speech CC 0 22 pl ✅ - [Multilingual LibriSpeech](https://www.openslr.org/94/) + Multilingual LibriSpeech CC BY 4.0 50,687 nl, en, fr, de, it, pl, pt, es ✅ - [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) + SIWIS CC BY 4.0 11 fr ✅ - [Speech Commands](http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz) + Speech Commands CC BY 4.0 18 en ✅ - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) + VCTK CC BY 4.0 44 en ✅ - [VoxPopuli](https://github.com/facebookresearch/voxpopuli) + VoxPopuli CC 0 383,500 bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv @@ -141,7 +141,7 @@ ✅ - [YouTube-Commons](https://huggingface.co/datasets/PleIAs/YouTube-Commons) + YouTube-Commons CC BY 4.0 3,261 bg, cs, nl, en, et, fr, de, el, hu, it, pl, pt, ro, es @@ -153,7 +153,7 @@ ✅ - [MOSEL :grapes:](https://huggingface.co/datasets/FBK-MT/mosel) + MOSEL :grapes: CC BY 4.0 441,206 bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv