Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster transform-gisaid #90

Merged
merged 2 commits into from
Aug 31, 2020
Merged

Faster transform-gisaid #90

merged 2 commits into from
Aug 31, 2020

Conversation

kairstenfay
Copy link
Contributor

@kairstenfay kairstenfay commented Aug 28, 2020

Processing in transform-gisaid is represented as a pipeline of steps. Each step either manipulates the data (Transform) or filters the data (Filter). At the start of each pipeline is a DataSource.

To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are:

  1. The new script does not, by default, sort the sequences before outputting. It does perform the same deduplication process however. To sort the sequence data, use the option --sorted-fasta.
  2. The new script interprets errors in source-data/gisaid_annotations.tsv differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed.

Additionally, there is the option --output-unix-newline, which forces all output (fasta files and metadata csv files) to use unix newlines.

Fixes #77


originally opened as #88 by @ttung

Processing in transform-gisaid is represented as a pipeline of steps.  Each step either manipulates the data (`Transform`) or filters the data (`Filter`).  At the start of each pipeline is a `DataSource`.

To test this, I processed the same gisaid dataset with both the old script and the new script.  The key differences in the output are:

1. The new script does _not_, by default, sort the sequences before outputting.  It does perform the same deduplication process however.  To sort the sequence data, use the option `--sorted-fasta`.
2. The new script interprets errors in `source-data/gisaid_annotations.tsv` differently than the old script.  It ignores all rows that do not have four columns.  As a result, some annotations are not processed.

Additionally, there is the option `--output-unix-newline`, which forces all output (fasta files and metadata csv files) to use unix newlines.

Fixes #77
@kairstenfay
Copy link
Contributor Author

Rebased onto master.

@kairstenfay kairstenfay self-assigned this Aug 28, 2020
The latest `transform-gisaid` script prints a warning to stdout when an
annotation could not be applied to the metadata. Send the output of
`transform-gisaid` to Slack to alert nCoV build maintainers of malformed
annotation entries.

./bin/flag-metadata data/gisaid/metadata.tsv > data/gisaid/flagged_metadata.txt
./bin/check-locations data/gisaid/metadata.tsv \
data/gisaid/location_hierarchy.tsv \
gisaid_epi_isl

if [[ "$branch" == master ]]; then
./bin/notify-slack --upload "flagged-annotations" < "$flagged_annotations"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The printed warnings in Tony's transform-gisaid are now sent to Slack, but only when on the master branch. I just wanted to check that this is the desirable behavior. @eharkins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me!

@kairstenfay kairstenfay merged commit bd2dda0 into master Aug 31, 2020
@kairstenfay kairstenfay deleted the pr/88 branch August 31, 2020 16:15
kairstenfay added a commit that referenced this pull request Aug 31, 2020
PR #90 recently introduced a major refactoring and optimization of
`transform-gisaid` that has resulted in producing Windows (CRLF) line
endings in the output TSVs. As a temporary fix to keep the ingest
operational, always invoke the `--output-unix-newline` option in
`ingest-gisaid` when invoking `transform-gisaid`.
kairstenfay added a commit that referenced this pull request Aug 31, 2020
PR #90 recently introduced a major refactoring and optimization of
`transform-gisaid` that has resulted in producing Windows (CRLF) line
endings in the metadata and additional info output TSVs. This was
somehow not detected in local testing where output files were shown to
be identical between the original and new transform scripts. As a
temporary fix to keep the ingest operational, always use the
`--output-unix-newline` option when invoking `transform-gisaid` in
`ingest-gisaid`.
@ttung ttung mentioned this pull request Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite transform-gisaid to perform a streaming transform
2 participants