Merge branch 'dev' into nf-core-template-merge-3.2.0

nf-core · Jan 29, 2025 · f24c762 · f24c762
2 parents b28687f + 3f276ec
commit f24c762
Show file tree

Hide file tree

Showing 64 changed files with 2,724 additions and 142 deletions.
diff --git a/.github/workflows/template_version_comment.yml b/.github/workflows/template_version_comment.yml
@@ -2,7 +2,8 @@ name: nf-core template version comment
 # This workflow is triggered on PRs to check if the pipeline template version matches the latest nf-core version.
 # It posts a comment to the PR, even if it comes from a fork.
 
-on: pull_request_target
+on:
+  pull_request:
 
 jobs:
   template_version:

diff --git a/.gitignore b/.gitignore
@@ -7,3 +7,4 @@ testing/
 testing*
 *.pyc
 null/
+.idea/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,8 +9,17 @@ Initial release of nf-core/drugresponseeval, created with the [nf-core](https://
 
 ### `Added`
 
+- Updated to the new template
+- Added tests that run with docker, singularity, apptainer, and conda
+- Added the docker container and the conda env.yml in the nextflow.config. We just need one container for all
+  processes as this pipeline automates the PyPI package drevalpy.
+- Added usage and output documentation.
+
 ### `Fixed`
 
+- Fixed linting issues
+- Fixed bugs with path_data: can now be handled as absolute and relative paths
+
 ### `Dependencies`
 
 ### `Deprecated`
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -1,5 +1,9 @@
 # nf-core/drugresponseeval: Citations
 
+## [DrugResponseEval](https://github.com/nf-core/drugresponseeval/)
+
+> Bernett, J, Iversen, P, Picciani, M, Wilhelm, M, Baum, K, List, M. Will be published soon.
+
 ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)
 
 > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.
@@ -10,6 +14,26 @@
 
 ## Pipeline tools
 
+- [DrEvalPy](https://github.com/daisybio/drevalpy): The pipeline mostly automates the individual steps of the DrEvalPy PyPI package.
+
+  > Bernett, J, Iversen, P, Picciani, M, Wilhelm, M, Baum, K, List, M. Will be published soon.
+
+- [DIPK](https://doi.org/10.1093/bib/bbae153): Implemented model in the pipeline.
+
+  > Li P, Jiang Z, Liu T, Liu X, Qiao H, Yao X. Improving drug response prediction via integrating gene relationships with deep learning. Briefings in Bioinformatics. 2024 May;25(3):bbae153.
+
+- [MOLI](https://doi.org/10.1093/bioinformatics/btz318): Implemented model in the pipeline.
+
+  > Sharifi-Noghabi H, Zolotareva O, Collins CC, Ester M. MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics. 2019 Jul;35(14):i501-9.
+
+- [SRMF](https://doi.org/10.1186/s12885-017-3500-5): Implemented model in the pipeline.
+
+  > Wang L, Li X, Zhang L, Gao Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC cancer. 2017 Dec;17:1-2.
+
+- [SuperFELT](https://doi.org/10.1186/s12859-021-04146-z): Implemented model in the pipeline.
+
+  > Park S, Soh J, Lee H. Super. FELT: supervised feature extraction learning using triplet loss for drug response prediction with multi-omics data. BMC bioinformatics. 2021 May 25;22(1):269.
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

diff --git a/README.md b/README.md
@@ -7,6 +7,7 @@
 
 [![GitHub Actions CI Status](https://github.com/nf-core/drugresponseeval/actions/workflows/ci.yml/badge.svg)](https://github.com/nf-core/drugresponseeval/actions/workflows/ci.yml)
 [![GitHub Actions Linting Status](https://github.com/nf-core/drugresponseeval/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/drugresponseeval/actions/workflows/linting.yml)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/drugresponseeval/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
+
 [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A524.04.2-23aa62.svg)](https://www.nextflow.io/)
@@ -15,52 +16,52 @@
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
 [![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/drugresponseeval)
 
-[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23drugresponseeval-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/drugresponseeval)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)
-
-## Introduction
+[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)
+[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)
+[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core)
 
-**nf-core/drugresponseeval** is a bioinformatics pipeline that ...
+# ![drevalpy_summary](assets/drevalpy-2-qr.svg)
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+## Introduction
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
+**DrEval** is a bioinformatics framework which includes a PyPI package (drevalpy) and a Nextflow
+pipeline (this repo). DrEval ensures that evaluations are statistically sound, biologically
+meaningful, and reproducible. DrEval simplifies the implementation of drug response prediction
+models, allowing researchers to focus on advancing their modeling innovations by automating
+standardized evaluation protocols and preprocessing workflows. With DrEval, hyperparameter
+tuning is fair and consistent. With its flexible model interface, DrEval supports any model type,
+ranging from statistical models to complex neural networks. By contributing your model to the
+DrEval catalog, you can increase your work's exposure, reusability, and transferability.
+
+# ![Pipeline diagram showing the major steps of nf-core/drugresponseeval](assets/drugresponseeval_pipeline_simplified.png)
+
+1. The response data is loaded
+2. All models are trained and evaluated in a cross-validation setting
+3. For each CV split, the best hyperparameters are determined using a grid search per model
+4. The model is trained on the full training set (train & validation) with the best
+   hyperparameters to predict the test set
+5. If randomization tests are enabled, the model is trained on the full training set with the best
+   hyperparameters to predict the randomized test set
+6. If robustness tests are enabled, the model is trained N times on the full training set with the
+   best hyperparameters
+7. Plots are created summarizing the results
+
+For baseline models, no randomization or robustness tests are performed.
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
-
-First, prepare a samplesheet with your input data that looks as follows:
-
-`samplesheet.csv`:
-
-```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
-```
-
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-
--->
-
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
 ```bash
 nextflow run nf-core/drugresponseeval \
    -profile <docker/singularity/.../institute> \
-   --input samplesheet.csv \
-   --outdir <OUTDIR>
+   --models <model1,model2,...> \
+   --baselines <baseline1,baseline2,...> \
+   --dataset_name <dataset_name> \
+   --path_data <path_data> \
 ```
 
 > [!WARNING]
@@ -76,14 +77,19 @@ For more details about the output files and reports, please refer to the
 
 ## Credits
 
-nf-core/drugresponseeval was originally written by Judith Bernett.
+nf-core/drugresponseeval was originally written by Judith Bernett (TUM) and Pascal Iversen (FU
+Berlin).
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
-
 ## Contributions and Support
 
+Contributors to nf-core/drugresponseeval and the drevalpy PyPI package:
+
+- [Judith Bernett](https://github.com/JudithBernett) (TUM)
+- [Pascal Iversen](https://github.com/PascalIversen) (FU Berlin)
+- [Mario Picciani](https://github.com/picciama) (TUM)
+
 If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
 
 For further information or help, don't hesitate to get in touch on the [Slack `#drugresponseeval` channel](https://nfcore.slack.com/channels/drugresponseeval) (you can join with [this invite](https://nf-co.re/join/slack)).

diff --git a/assets/drevalpy-2-qr.svg b/assets/drevalpy-2-qr.svg
diff --git a/assets/drugresponseeval_pipeline_simplified.png b/assets/drugresponseeval_pipeline_simplified.png
diff --git a/bin/check_params.py b/bin/check_params.py
@@ -0,0 +1,13 @@
+#!/usr/bin/env python
+import sys
+from drevalpy.utils import get_parser, check_arguments
+
+
+def main(argv=None):
+    """Coordinate argument parsing and program execution."""
+    args = get_parser().parse_args(argv)
+    check_arguments(args)
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/bin/collect_results.py b/bin/collect_results.py
@@ -0,0 +1,59 @@
+#!/usr/bin/env python
+import argparse
+import pandas as pd
+
+from drevalpy.visualization.utils import prep_results, write_results
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(description="Collect results and write to single files.")
+    parser.add_argument("--outfiles", type=str, nargs="+", required=True, help="Output files.")
+    return parser
+
+
+def parse_results(args):
+    # get all files with the pattern f'{model_name}_evaluation_results.csv' from args.outfiles
+    result_files = [file for file in args.outfiles if "evaluation_results.csv" in file]
+    # get all files with the pattern f'{model_name}_evaluation_results_per_drug.csv' from args.outfiles
+    result_per_drug_files = [file for file in args.outfiles if "evaluation_results_per_drug.csv" in file]
+    # get all files with the pattern f'{model_name}_evaluation_results_per_cl.csv' from args.outfiles
+    result_per_cl_files = [file for file in args.outfiles if "evaluation_results_per_cl.csv" in file]
+    # get all files with the pattern f'{model_name}_true_vs_pred.csv' from args.outfiles
+    t_vs_pred_files = [file for file in args.outfiles if "true_vs_pred.csv" in file]
+    return result_files, result_per_drug_files, result_per_cl_files, t_vs_pred_files
+
+
+def collapse_file(files):
+    out_df = None
+    for file in files:
+        if out_df is None:
+            out_df = pd.read_csv(file, index_col=0)
+        else:
+            out_df = pd.concat([out_df, pd.read_csv(file, index_col=0)])
+    return out_df
+
+
+if __name__ == "__main__":
+    args = get_parser().parse_args()
+    # parse the results from args.outfiles
+    eval_result_files, eval_result_per_drug_files, eval_result_per_cl_files, true_vs_pred_files = parse_results(args)
+
+    # collapse the results into single dataframes
+    eval_results = collapse_file(eval_result_files)
+    eval_results_per_drug = collapse_file(eval_result_per_drug_files)
+    eval_results_per_cell_line = collapse_file(eval_result_per_cl_files)
+    t_vs_p = collapse_file(true_vs_pred_files)
+
+    # prepare the results through introducing new columns algorithm, rand_setting, LPO_LCO_LDO, split, CV_split
+    eval_results, eval_results_per_drug, eval_results_per_cell_line, t_vs_p = prep_results(
+        eval_results, eval_results_per_drug, eval_results_per_cell_line, t_vs_p
+    )
+
+    # save the results to csv files
+    write_results(
+        path_out="",
+        eval_results=eval_results,
+        eval_results_per_drug=eval_results_per_drug,
+        eval_results_per_cl=eval_results_per_cell_line,
+        t_vs_p=t_vs_p,
+    )
diff --git a/bin/consolidate_results.py b/bin/consolidate_results.py
@@ -0,0 +1,50 @@
+#!/usr/bin/env python
+
+import os
+import argparse
+from drevalpy.models import MODEL_FACTORY
+from drevalpy.experiment import consolidate_single_drug_model_predictions
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(description="Consolidate results for SingleDrugModels")
+    parser.add_argument('--run_id', type=str, required=True, help="Run ID")
+    parser.add_argument("--test_mode", type=str, required=True, help="Test mode (LPO, LCO, LDO)")
+    parser.add_argument("--model_name", type=str, required=True, help="All Model "
+                                                                                  "names")
+    parser.add_argument("--outdir_path", type=str, required=True, help="Output directory path")
+    parser.add_argument("--n_cv_splits", type=int, required=True, help="Number of CV splits")
+    parser.add_argument("--cross_study_datasets", type=str, nargs="+", help="All "
+                                                                                          "cross-study "
+                                                                                          "datasets")
+    parser.add_argument("--randomization_modes", type=str, required=True, help="All "
+                                                                                     "randomizations")
+    parser.add_argument("--n_trials_robustness", type=int, required=True, help="Number of trials")
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+    results_path = os.path.join(
+        args.outdir_path,
+        args.run_id,
+        args.test_mode,
+    )
+    randomizations = args.randomization_modes.split('[')[1].split(']')[0].split(', ')
+    model = MODEL_FACTORY[args.model_name]
+    if args.cross_study_datasets is None:
+        args.cross_study_datasets = []
+    consolidate_single_drug_model_predictions(
+        models=[model],
+        n_cv_splits=args.n_cv_splits,
+        results_path=results_path,
+        cross_study_datasets=args.cross_study_datasets,
+        randomization_mode=randomizations,
+        n_trials_robustness=args.n_trials_robustness,
+        out_path=""
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/bin/cv_split.py b/bin/cv_split.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python
+
+import argparse
+import pickle
+import sys
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(description="Split data into CV splits")
+    parser.add_argument("--response", type=str, required=True, help="Path to response data")
+    parser.add_argument("--n_cv_splits", type=int, required=True, help="Number of CV splits")
+    parser.add_argument("--test_mode", type=str, default="LPO", help="Test mode (LPO, LCO, LDO)")
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+    response_data = pickle.load(open(args.response, "rb"))
+    response_data.remove_nan_responses()
+    response_data.split_dataset(
+        n_cv_splits=args.n_cv_splits,
+        mode=args.test_mode,
+        split_validation=True,
+        split_early_stopping=True,
+        validation_ratio=0.1,
+        random_state=42,
+    )
+    for split_index, split in enumerate(response_data.cv_splits):
+        with open(f"split_{split_index}.pkl", "wb") as f:
+            pickle.dump(split, f)
+
+
+if __name__ == "__main__":
+    main()
+    sys.exit(0)
diff --git a/bin/draw_cd.py b/bin/draw_cd.py
@@ -0,0 +1,29 @@
+#!/usr/bin/env python
+import argparse
+import pandas as pd
+
+from drevalpy.visualization.critical_difference_plot import CriticalDifferencePlot
+
+def get_parser():
+    parser = argparse.ArgumentParser(description="Draw critical difference plots.")
+    parser.add_argument("--name", type=str, required=True, help="Name/Setting of plot.")
+    parser.add_argument("--data", type=str, required=True, help="Path to data.")
+    return parser
+
+
+def draw_cd(path_to_df: str, setting: str):
+    df = pd.read_csv(path_to_df, index_col=0)
+    df = df[(df["LPO_LCO_LDO"] == setting) & (df["rand_setting"] == "predictions")]
+    cd_plot = CriticalDifferencePlot(
+        eval_results_preds=df,
+        metric='MSE'
+    )
+    cd_plot.draw_and_save(
+        out_prefix='',
+        out_suffix=setting
+    )
+
+
+if __name__ == "__main__":
+    args = get_parser().parse_args()
+    draw_cd(path_to_df=args.data, setting=args.name)
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,3 +7,4 @@ testing/ @@
     testing*
     *.pyc
     null/
+    .idea/