-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(airflow): prototype finemapping batch job #581
Conversation
059aea9
to
8f92740
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rebased this PR against dev and resolved some minor merge conflicts to prevent it from getting too much out of sync.
I think this is a great idea overall; however, I find this PR a bit difficult to handle as it is because it introduces three changes simultaneously:
- Introducing Google Batch with Docker;
- Making the pipeline incremental;
- Modifying the internal logic of SuSie finemapping.
And further (this could be considered point # 4) to make this pipeline work, we would first need to make additional changes into how clumped outputs are produced.
So again, while I agree in principle with all of the changes proposed, at this point I will not be progressing this PR in its current form. Rather, as the first atomic change, I will use the prototype code provided to implement the existing finemapping approach (without any changes in logic) using Docker in the Airflow DAG.
@tskir feel free to close the PR when it's not useful anymore. The intention from @ireneisdoomed and me was to create a proof-of-concept not to merge these changes |
The finemapping DAG has been merged in opentargets/orchestration#10, so I'm now closing this pull request. Many ideas from it have been very helpful and have been used in my final PR — thank you @ireneisdoomed @d0choa! |
@ireneisdoomed I didn't delete the branch in case there's something useful you planned to use there — please delete if not! |
New Airflow DAG that calls the susie finemapper step on a list of studyLocus IDs to generate credible sets.
✨ Context
This is an incremental pipeline, so the first 3 tasks involve the operation of generating a list of studyLocus to finemap. Once the difference is computed, we finemap those IDs as a job to Google Batch.
Graph nodes:
get_all_study_locus_ids
. Lists the bucket with the clumped study locus IDs and extracts all IDs that we could theoretically finemap based on the filename. A requirement for this step is that clumped study loci are written partitioned by their ID - which is currently not implemented (e.g.gs://genetics-portal-dev-analysis/irene/toy_studdy_locus_alzheimer_partitioned
)get_finemapped_paths
. Lists the bucket that contains all credible sets as a result of fine-mapping and extracts the IDs from the filename. Similarly, it is also required that the SLID is part of the filename. This is partially implemented, already. The data is not partitioned, but the ID is used to build the output path.get_study_loci_to_finemap
. Creates a list of IDs based on the difference between get_all_study_locus_ids and get_finemapped_paths.finemapping_task
. The one that interfaces with the Google Batch operator to create one Batch job that runs the Docker container with as many tasks in parallel as studyLoci IDs we have extracted in get_study_loci_to_finemap. The command run in the container image calls the fine mapping step with the appropriate parameters.🛠 What does this PR implement
SusieFineMapperStep
that instead of passing the row of the studyLocus to finemap, it passes the dataframe (containing that one row). This is to prevent incompatibilities between the input data, and the schema (I had the problem that after partitioning the data by SLID and then reading the data, this column was appended at the end)🙈 Missing
get_all_study_locus_ids
node. This involves changingld_based_clumping.py
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?