Feat/select testset #59

AJDERS · 2024-01-10T10:51:27Z

Select a test according to the following restrictions:

Quantity: The total amount of test data in hours will be 15 hours, divided as follows: 7.5 hours of read aloud and 7.5 hours of conversational data.
Dialects: The speakers must be distributed so that each region is represented at least 15%.
Gender: Each gender must represent at least 45% of the total hours for both types of data.
Age: 20% of each age group, i.e. <25, >25,<50 and <50.
Accent: At least 10% of the total hours for both types of data must be represented by individuals with accents.

…ize selection

…regions

… conv/read

…ming

AJDERS · 2024-02-28T12:11:53Z

I ended up handpicking a testset. This selection of speakers (almost) satisfies the above mentioned criteria:

The shortcomings of this selection is the lack of fynske-dialects and bornholmsk, but besides this the dialect distribution is quite good. Once we've made recordings at Fyn and Bornholm the testset should be finished. The dialect and dialect group distribution can be seen here:

Once the database structure on the NAS has been implemented and the anonymization scheme has been decided we can create a script for uploading to huggingface, but we'll do that in a seperate PR.

config/config.yaml

config/datasets/coral_test_set.yaml

src/coral_models/protocols.py

sorenmulli

Looks like reasonable criteria and a very interesting graphs with nice distributions!

The accent category simply means all speakers for whom Danish is not first language with noticable accents?

And as I understand it, the test set division is made in blocks of speakers, that is, a speaker cannot both have some audio in test and train, right? We test generalization to new speakers? That makes sense to me.

sorenmulli · 2024-03-05T09:04:07Z

Additional thoughts after talking with Martin:

The shown graphs use audio time proportion as X axis as I understand it. We should of course also remember to consider distribution of unique speakers, e.g. the current ~18 minutes of audio from Fyn is from one speaker, right? :)
Similar criteria as you had for region would be nice to have for dialect in future versions.
15 hours (and more with Fyn+Bornholm additions) is a rather large test set for ASR - is it chosen b/c it is 10% of data?

saattrupdan · 2024-03-09T11:17:17Z

15 hours (and more with Fyn+Bornholm additions) is a rather large test set for ASR - is it chosen b/c it is 10% of data?

The large size is due to us wanting to both have a large diversity in the test set (different dialects, genders, ages and so on) and at the same time not have any speaker overlap between the train and test splits (each speaker has around 20-30 min speech).

saattrupdan · 2024-03-09T11:18:49Z

@AJDERS Can we get this merged in and get the test set up asap?

src/scripts/push_coral_to_hub.py

AJDERS · 2024-03-11T12:40:06Z

I have found a few inconsistencies in how the student-workers has input languages, which i am going to open an issue with and fix upstream.

src/scripts/push_coral_to_hub.py

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>

src/scripts/push_coral_to_hub.py

saattrupdan

LGTM!

AJDERS added 11 commits January 8, 2024 13:28

feat: add first stab at selection of testset

6bfcf6c

feat: add selection methods

d42a14e

feat: change from small to correct size selection to large to small s…

f658c91

…ize selection

feat: change from small to correct size selection to large to small s…

059e4ca

…ize selection

feat: change from small to correct size selection to large to small s…

0cbe930

…ize selection

feat: fix selection criteria order

20c115c

feat: deselect uniformly from each region when removing w / wo accent

8d0f794

feat: take as large set of speakers as possible with uniform dist of …

79f00f1

…regions

feat: finalize accent selection

0de4bc4

feat: simplify select from accent

5fb63cc

feat: fix selection to avoid too small dataset

b28fbc6

AJDERS self-assigned this Jan 10, 2024

AJDERS marked this pull request as draft January 10, 2024 10:51

AJDERS added 6 commits January 10, 2024 12:33

feat: hand select a few speakers and filter to get correct amounts of…

aebae0e

… conv/read

feat: need to select on dialect group and not just region

9c22e57

feat: use handpicked speakers instead of selecting them using program…

62655d0

…ming

feat: remove select_testset.py

1f9a29c

Merge remote-tracking branch 'origin/main' into feat/select-testset

e355e46

fix: black & update black

f27802d

AJDERS added 2 commits February 29, 2024 11:13

feat: poetry lock

634c81f

feat: fix datasets error

71b5782

AJDERS requested review from saattrupdan and sorenmulli March 1, 2024 12:15

saattrupdan requested changes Mar 1, 2024

View reviewed changes

config/config.yaml Outdated Show resolved Hide resolved

config/datasets/coral_test_set.yaml Outdated Show resolved Hide resolved

src/coral_models/protocols.py Show resolved Hide resolved

saattrupdan marked this pull request as ready for review March 1, 2024 15:05

sorenmulli approved these changes Mar 4, 2024

View reviewed changes

AJDERS added 2 commits March 4, 2024 13:00

feat: remove dataset config

0c24bd1

feat: add test speakers to build coral test dataset

3433269

AJDERS added 2 commits March 11, 2024 11:23

feat: rename and change to push all of the data

bb0c5b0

feat: add iteration data, fix dtypes and add versions to hub id

526e741

AJDERS commented Mar 11, 2024

View reviewed changes

src/scripts/push_coral_to_hub.py Outdated Show resolved Hide resolved

AJDERS requested a review from saattrupdan March 11, 2024 12:40

saattrupdan requested changes Mar 11, 2024

View reviewed changes

saattrupdan reviewed Mar 11, 2024

View reviewed changes

src/scripts/push_coral_to_hub.py Outdated Show resolved Hide resolved

saattrupdan reviewed Mar 11, 2024

View reviewed changes

src/scripts/push_coral_to_hub.py Outdated Show resolved Hide resolved

AJDERS and others added 9 commits March 11, 2024 14:00

Update src/scripts/push_coral_to_hub.py

d014f72

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>

Update src/scripts/push_coral_to_hub.py

5be1040

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>

Update src/scripts/push_coral_to_hub.py

526a589

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>

Update src/scripts/push_coral_to_hub.py

7217eca

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>

Update src/scripts/push_coral_to_hub.py

b9734fb

Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>

fix: change click argument to options and remove unused defaults

b6e1771

feat: check if datset is on hub

a071c6e

fix: remove unused config

95fa9de

feat: append both speakers metadata

bf74872

AJDERS requested a review from saattrupdan March 11, 2024 14:57

feat: make sure that user dont upload a lower version than what exists

24db3c1

saattrupdan requested changes Mar 12, 2024

View reviewed changes

src/scripts/push_coral_to_hub.py Outdated Show resolved Hide resolved

src/scripts/push_coral_to_hub.py Outdated Show resolved Hide resolved

src/scripts/push_coral_to_hub.py Show resolved Hide resolved

feat: better checking of existing dataset on hub

01eb3ac

AJDERS requested a review from saattrupdan March 13, 2024 11:23

saattrupdan approved these changes Mar 13, 2024

View reviewed changes

AJDERS merged commit efe634a into main Mar 14, 2024
5 checks passed

saattrupdan deleted the feat/select-testset branch July 1, 2024 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/select testset #59

Feat/select testset #59

AJDERS commented Jan 10, 2024

AJDERS commented Feb 28, 2024

sorenmulli left a comment

sorenmulli commented Mar 5, 2024

saattrupdan commented Mar 9, 2024

saattrupdan commented Mar 9, 2024

AJDERS commented Mar 11, 2024

saattrupdan left a comment

Feat/select testset #59

Feat/select testset #59

Conversation

AJDERS commented Jan 10, 2024

AJDERS commented Feb 28, 2024

sorenmulli left a comment

Choose a reason for hiding this comment

sorenmulli commented Mar 5, 2024

saattrupdan commented Mar 9, 2024

saattrupdan commented Mar 9, 2024

AJDERS commented Mar 11, 2024

saattrupdan left a comment

Choose a reason for hiding this comment