Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(airflow): prototype finemapping batch job #581

Closed
wants to merge 5 commits into from

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Apr 23, 2024

New Airflow DAG that calls the susie finemapper step on a list of studyLocus IDs to generate credible sets.

✨ Context

This is an incremental pipeline, so the first 3 tasks involve the operation of generating a list of studyLocus to finemap. Once the difference is computed, we finemap those IDs as a job to Google Batch.

image

Graph nodes:

  • get_all_study_locus_ids. Lists the bucket with the clumped study locus IDs and extracts all IDs that we could theoretically finemap based on the filename. A requirement for this step is that clumped study loci are written partitioned by their ID - which is currently not implemented (e.g. gs://genetics-portal-dev-analysis/irene/toy_studdy_locus_alzheimer_partitioned)
  • get_finemapped_paths. Lists the bucket that contains all credible sets as a result of fine-mapping and extracts the IDs from the filename. Similarly, it is also required that the SLID is part of the filename. This is partially implemented, already. The data is not partitioned, but the ID is used to build the output path.
  • get_study_loci_to_finemap. Creates a list of IDs based on the difference between get_all_study_locus_ids and get_finemapped_paths.
  • finemapping_task. The one that interfaces with the Google Batch operator to create one Batch job that runs the Docker container with as many tasks in parallel as studyLoci IDs we have extracted in get_study_loci_to_finemap. The command run in the container image calls the fine mapping step with the appropriate parameters.

🛠 What does this PR implement

  • The DAG explained above.
  • A small change in the logic of SusieFineMapperStep that instead of passing the row of the studyLocus to finemap, it passes the dataframe (containing that one row). This is to prevent incompatibilities between the input data, and the schema (I had the problem that after partitioning the data by SLID and then reading the data, this column was appended at the end)

🙈 Missing

  • Testing that the logic works after fixing the schema issues. This requires creating a new image.
  • Fine tuning the resources allocated in the Batch job. This crosses over with the work done by @tskir
  • Sorting out the input data for this DAG:
    • First, we need to partitioning the data prior to the finemapping step as required by the get_all_study_locus_ids node. This involves changing ld_based_clumping.py
    • Then, we have to be mindful that we want to fine map studyLoci from 2 sources: UKBB PPP and GWAS Catalog. So either we run the DAG twice or we put all clumped study loci under the same location.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@tskir tskir force-pushed the ildo-aiflow-finemap branch from 059aea9 to 8f92740 Compare June 25, 2024 11:25
Copy link
Contributor

@tskir tskir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased this PR against dev and resolved some minor merge conflicts to prevent it from getting too much out of sync.

I think this is a great idea overall; however, I find this PR a bit difficult to handle as it is because it introduces three changes simultaneously:

  1. Introducing Google Batch with Docker;
  2. Making the pipeline incremental;
  3. Modifying the internal logic of SuSie finemapping.

And further (this could be considered point # 4) to make this pipeline work, we would first need to make additional changes into how clumped outputs are produced.

So again, while I agree in principle with all of the changes proposed, at this point I will not be progressing this PR in its current form. Rather, as the first atomic change, I will use the prototype code provided to implement the existing finemapping approach (without any changes in logic) using Docker in the Airflow DAG.

@d0choa
Copy link
Collaborator

d0choa commented Jul 18, 2024

@tskir feel free to close the PR when it's not useful anymore. The intention from @ireneisdoomed and me was to create a proof-of-concept not to merge these changes

@tskir
Copy link
Contributor

tskir commented Sep 20, 2024

The finemapping DAG has been merged in opentargets/orchestration#10, so I'm now closing this pull request.

Many ideas from it have been very helpful and have been used in my final PR — thank you @ireneisdoomed @d0choa!

@tskir tskir closed this Sep 20, 2024
@tskir
Copy link
Contributor

tskir commented Sep 20, 2024

@ireneisdoomed I didn't delete the branch in case there's something useful you planned to use there — please delete if not!

@tskir tskir deleted the ildo-aiflow-finemap branch September 20, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants