feat(airflow): prototype finemapping batch job #581

ireneisdoomed · 2024-04-23T14:36:10Z

New Airflow DAG that calls the susie finemapper step on a list of studyLocus IDs to generate credible sets.

✨ Context

This is an incremental pipeline, so the first 3 tasks involve the operation of generating a list of studyLocus to finemap. Once the difference is computed, we finemap those IDs as a job to Google Batch.

Graph nodes:

get_all_study_locus_ids. Lists the bucket with the clumped study locus IDs and extracts all IDs that we could theoretically finemap based on the filename. A requirement for this step is that clumped study loci are written partitioned by their ID - which is currently not implemented (e.g. gs://genetics-portal-dev-analysis/irene/toy_studdy_locus_alzheimer_partitioned)
get_finemapped_paths. Lists the bucket that contains all credible sets as a result of fine-mapping and extracts the IDs from the filename. Similarly, it is also required that the SLID is part of the filename. This is partially implemented, already. The data is not partitioned, but the ID is used to build the output path.
get_study_loci_to_finemap. Creates a list of IDs based on the difference between get_all_study_locus_ids and get_finemapped_paths.
finemapping_task. The one that interfaces with the Google Batch operator to create one Batch job that runs the Docker container with as many tasks in parallel as studyLoci IDs we have extracted in get_study_loci_to_finemap. The command run in the container image calls the fine mapping step with the appropriate parameters.

🛠 What does this PR implement

The DAG explained above.
A small change in the logic of SusieFineMapperStep that instead of passing the row of the studyLocus to finemap, it passes the dataframe (containing that one row). This is to prevent incompatibilities between the input data, and the schema (I had the problem that after partitioning the data by SLID and then reading the data, this column was appended at the end)

🙈 Missing

Testing that the logic works after fixing the schema issues. This requires creating a new image.
Fine tuning the resources allocated in the Batch job. This crosses over with the work done by @tskir
Sorting out the input data for this DAG:
- First, we need to partitioning the data prior to the finemapping step as required by the get_all_study_locus_ids node. This involves changing ld_based_clumping.py
- Then, we have to be mindful that we want to fine map studyLoci from 2 sources: UKBB PPP and GWAS Catalog. So either we run the DAG twice or we put all clumped study loci under the same location.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

tskir

I've rebased this PR against dev and resolved some minor merge conflicts to prevent it from getting too much out of sync.

I think this is a great idea overall; however, I find this PR a bit difficult to handle as it is because it introduces three changes simultaneously:

Introducing Google Batch with Docker;
Making the pipeline incremental;
Modifying the internal logic of SuSie finemapping.

And further (this could be considered point # 4) to make this pipeline work, we would first need to make additional changes into how clumped outputs are produced.

So again, while I agree in principle with all of the changes proposed, at this point I will not be progressing this PR in its current form. Rather, as the first atomic change, I will use the prototype code provided to implement the existing finemapping approach (without any changes in logic) using Docker in the Airflow DAG.

d0choa · 2024-07-18T08:17:09Z

@tskir feel free to close the PR when it's not useful anymore. The intention from @ireneisdoomed and me was to create a proof-of-concept not to merge these changes

tskir · 2024-09-20T14:57:08Z

The finemapping DAG has been merged in opentargets/orchestration#10, so I'm now closing this pull request.

Many ideas from it have been very helpful and have been used in my final PR — thank you @ireneisdoomed @d0choa!

tskir · 2024-09-20T14:58:28Z

@ireneisdoomed I didn't delete the branch in case there's something useful you planned to use there — please delete if not!

github-actions bot added size-S Feature airflow size-M Step and removed size-S labels Apr 23, 2024

d0choa mentioned this pull request Apr 30, 2024

Implement parallel finemapping computation opentargets/issues#3302

Closed

3 tasks

ireneisdoomed added 5 commits June 25, 2024 12:24

feat(airflow): prototype finemapping batch job

3d6c886

feat(airflow): send task per studylocusid

f5abb3d

feat(airflow): prototype passing study loci to finemap

bf8d425

feat(airflow): full working dag

1a20f57

fix(finemapper): avoid schema issues by passing df

8f92740

tskir force-pushed the ildo-aiflow-finemap branch from 059aea9 to 8f92740 Compare June 25, 2024 11:25

tskir reviewed Jun 25, 2024

View reviewed changes

tskir closed this Sep 20, 2024

tskir deleted the ildo-aiflow-finemap branch September 20, 2024 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(airflow): prototype finemapping batch job #581

feat(airflow): prototype finemapping batch job #581

ireneisdoomed commented Apr 23, 2024 •

edited

Loading

tskir left a comment

d0choa commented Jul 18, 2024

tskir commented Sep 20, 2024

tskir commented Sep 20, 2024

feat(airflow): prototype finemapping batch job #581

feat(airflow): prototype finemapping batch job #581

Conversation

ireneisdoomed commented Apr 23, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

tskir left a comment

Choose a reason for hiding this comment

d0choa commented Jul 18, 2024

tskir commented Sep 20, 2024

tskir commented Sep 20, 2024

ireneisdoomed commented Apr 23, 2024 •

edited

Loading