Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/select testset #59

Merged
merged 34 commits into from
Mar 14, 2024
Merged

Feat/select testset #59

merged 34 commits into from
Mar 14, 2024

Conversation

AJDERS
Copy link
Contributor

@AJDERS AJDERS commented Jan 10, 2024

Select a test according to the following restrictions:

  1. Quantity: The total amount of test data in hours will be 15 hours, divided as follows: 7.5 hours of read aloud and 7.5 hours of conversational data.

  2. Dialects: The speakers must be distributed so that each region is represented at least 15%.

  3. Gender: Each gender must represent at least 45% of the total hours for both types of data.

  4. Age: 20% of each age group, i.e. <25, >25,<50 and <50.

  5. Accent: At least 10% of the total hours for both types of data must be represented by individuals with accents.

@AJDERS AJDERS self-assigned this Jan 10, 2024
@AJDERS AJDERS marked this pull request as draft January 10, 2024 10:51
@AJDERS
Copy link
Contributor Author

AJDERS commented Feb 28, 2024

I ended up handpicking a testset. This selection of speakers (almost) satisfies the above mentioned criteria:
distri

The shortcomings of this selection is the lack of fynske-dialects and bornholmsk, but besides this the dialect distribution is quite good. Once we've made recordings at Fyn and Bornholm the testset should be finished. The dialect and dialect group distribution can be seen here:

dialect_distri

Once the database structure on the NAS has been implemented and the anonymization scheme has been decided we can create a script for uploading to huggingface, but we'll do that in a seperate PR.

@AJDERS AJDERS requested review from saattrupdan and sorenmulli March 1, 2024 12:15
@saattrupdan saattrupdan marked this pull request as ready for review March 1, 2024 15:05
Copy link
Collaborator

@sorenmulli sorenmulli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like reasonable criteria and a very interesting graphs with nice distributions!

The accent category simply means all speakers for whom Danish is not first language with noticable accents?

And as I understand it, the test set division is made in blocks of speakers, that is, a speaker cannot both have some audio in test and train, right? We test generalization to new speakers? That makes sense to me.

@sorenmulli
Copy link
Collaborator

Additional thoughts after talking with Martin:

  • The shown graphs use audio time proportion as X axis as I understand it. We should of course also remember to consider distribution of unique speakers, e.g. the current ~18 minutes of audio from Fyn is from one speaker, right? :)
  • Similar criteria as you had for region would be nice to have for dialect in future versions.
  • 15 hours (and more with Fyn+Bornholm additions) is a rather large test set for ASR - is it chosen b/c it is 10% of data?

@saattrupdan
Copy link
Collaborator

15 hours (and more with Fyn+Bornholm additions) is a rather large test set for ASR - is it chosen b/c it is 10% of data?

The large size is due to us wanting to both have a large diversity in the test set (different dialects, genders, ages and so on) and at the same time not have any speaker overlap between the train and test splits (each speaker has around 20-30 min speech).

@saattrupdan
Copy link
Collaborator

@AJDERS Can we get this merged in and get the test set up asap?

@AJDERS
Copy link
Contributor Author

AJDERS commented Mar 11, 2024

I have found a few inconsistencies in how the student-workers has input languages, which i am going to open an issue with and fix upstream.

@AJDERS AJDERS requested a review from saattrupdan March 11, 2024 12:40
AJDERS and others added 9 commits March 11, 2024 14:00
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
@AJDERS AJDERS requested a review from saattrupdan March 11, 2024 14:57
@AJDERS AJDERS requested a review from saattrupdan March 13, 2024 11:23
Copy link
Collaborator

@saattrupdan saattrupdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@AJDERS AJDERS merged commit efe634a into main Mar 14, 2024
5 checks passed
@saattrupdan saattrupdan deleted the feat/select-testset branch July 1, 2024 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants