-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster transform-gisaid #90
Conversation
Processing in transform-gisaid is represented as a pipeline of steps. Each step either manipulates the data (`Transform`) or filters the data (`Filter`). At the start of each pipeline is a `DataSource`. To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are: 1. The new script does _not_, by default, sort the sequences before outputting. It does perform the same deduplication process however. To sort the sequence data, use the option `--sorted-fasta`. 2. The new script interprets errors in `source-data/gisaid_annotations.tsv` differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed. Additionally, there is the option `--output-unix-newline`, which forces all output (fasta files and metadata csv files) to use unix newlines. Fixes #77
Rebased onto master. |
The latest `transform-gisaid` script prints a warning to stdout when an annotation could not be applied to the metadata. Send the output of `transform-gisaid` to Slack to alert nCoV build maintainers of malformed annotation entries.
|
||
./bin/flag-metadata data/gisaid/metadata.tsv > data/gisaid/flagged_metadata.txt | ||
./bin/check-locations data/gisaid/metadata.tsv \ | ||
data/gisaid/location_hierarchy.tsv \ | ||
gisaid_epi_isl | ||
|
||
if [[ "$branch" == master ]]; then | ||
./bin/notify-slack --upload "flagged-annotations" < "$flagged_annotations" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The printed warnings in Tony's transform-gisaid
are now sent to Slack, but only when on the master branch. I just wanted to check that this is the desirable behavior. @eharkins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me!
PR #90 recently introduced a major refactoring and optimization of `transform-gisaid` that has resulted in producing Windows (CRLF) line endings in the output TSVs. As a temporary fix to keep the ingest operational, always invoke the `--output-unix-newline` option in `ingest-gisaid` when invoking `transform-gisaid`.
PR #90 recently introduced a major refactoring and optimization of `transform-gisaid` that has resulted in producing Windows (CRLF) line endings in the metadata and additional info output TSVs. This was somehow not detected in local testing where output files were shown to be identical between the original and new transform scripts. As a temporary fix to keep the ingest operational, always use the `--output-unix-newline` option when invoking `transform-gisaid` in `ingest-gisaid`.
Processing in transform-gisaid is represented as a pipeline of steps. Each step either manipulates the data (
Transform
) or filters the data (Filter
). At the start of each pipeline is aDataSource
.To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are:
--sorted-fasta
.source-data/gisaid_annotations.tsv
differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed.Additionally, there is the option
--output-unix-newline
, which forces all output (fasta files and metadata csv files) to use unix newlines.Fixes #77
originally opened as #88 by @ttung