-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/select testset #59
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like reasonable criteria and a very interesting graphs with nice distributions!
The accent category simply means all speakers for whom Danish is not first language with noticable accents?
And as I understand it, the test set division is made in blocks of speakers, that is, a speaker cannot both have some audio in test and train, right? We test generalization to new speakers? That makes sense to me.
Additional thoughts after talking with Martin:
|
The large size is due to us wanting to both have a large diversity in the test set (different dialects, genders, ages and so on) and at the same time not have any speaker overlap between the train and test splits (each speaker has around 20-30 min speech). |
@AJDERS Can we get this merged in and get the test set up asap? |
I have found a few inconsistencies in how the student-workers has input languages, which i am going to open an issue with and fix upstream. |
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
Co-authored-by: Dan Saattrup Nielsen <47701536+saattrupdan@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Select a test according to the following restrictions:
Quantity: The total amount of test data in hours will be 15 hours, divided as follows: 7.5 hours of read aloud and 7.5 hours of conversational data.
Dialects: The speakers must be distributed so that each region is represented at least 15%.
Gender: Each gender must represent at least 45% of the total hours for both types of data.
Age: 20% of each age group, i.e. <25, >25,<50 and <50.
Accent: At least 10% of the total hours for both types of data must be represented by individuals with accents.