Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging module speed strongly depends on file tree structure #16

Open
MartinKl opened this issue May 14, 2021 · 0 comments
Open

Merging module speed strongly depends on file tree structure #16

MartinKl opened this issue May 14, 2021 · 0 comments

Comments

@MartinKl
Copy link
Contributor

MartinKl commented May 14, 2021

When merging data the merger matches corresponding files by their file path. A large number of files in the same folder (or parent node) seem to significantly slow down processing (not even certain the process would ever terminate).

My personal example, but simplified: I have data of speakers of two age groups (adolescents, adults) and two speaker types (monolingual vs. bilingual). And for each I have a file in two formats. Consider the following arrangement (I):

FORMAT_A/CORPUS/MONOLINGUAL/*.format_a (48 files)
FORMAT_A/CORPUS/BILINGUAL/*.format_a (128 files)
FORMAT_B/CORPUS/MONOLINGUAL/*.format_b (48 filess)
FORMAT_B/CORPUS/BILINGUAL/*.format_b (128 files)

Trying to merge the imports of FORMAT_A_Importer and FORMAT_B_Importer does not terminate or is at least very very slow.

Another view on the data could be (II):

FORMAT_A/CORPUS/ADULTS/MONOLINGUAL/*.format_a (24 files)
FORMAT_A/CORPUS/ADULTS/BILINGUAL/*.format_a (64 files)
FORMAT_A/CORPUS/ADOLESCENTS/MONOLINGUAL/*.format_a (24 files)
FORMAT_A/CORPUS/ADOLESCENTS/BILINGUAL/*.format_a (64 files)
FORMAT_B/CORPUS/ADULTS/MONOLINGUAL/*.format_b (24 files)
FORMAT_B/CORPUS/ADULTS/BILINGUAL/*.format_b (64 files)
FORMAT_B/CORPUS/ADOLESCENTS/MONOLINGUAL/*.format_b (24 files)
FORMAT_B/CORPUS/ADOLESCENTS/BILINGUAL/*.format_b (64 files)

Arranging the data like this leads to successful merging. Not sure what the source of this is, but I assume pairing documents works more efficiently or does not lock up. Just a guess.

During the non-terminating scenario (I) all processor cores run under full load until pepper is stopped by keyboard interrupt. Progress updates are printed (but from what I can tell no progress is made, not entirely sure about that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant