add support for Cascadia search engine

mriffle · Feb 24, 2025 · 06671c3 · 06671c3
1 parent 5b58b2f
commit 06671c3
Show file tree

Hide file tree

Showing 11 changed files with 345 additions and 13 deletions.
diff --git a/conf/output_directories.config b/conf/output_directories.config
@@ -5,6 +5,7 @@ params {
         aws:              "${params.result_dir}/aws",
         msconvert:        "${params.result_dir}/msconvert",
         diann:            "${params.result_dir}/diann",
+        cascadia:         "${params.result_dir}/cascadia",
         qc_report:        "${params.result_dir}/qc_report",
         qc_report_tables: "${params.result_dir}/qc_report/tables",
         gene_reports:     "${params.result_dir}/gene_reports",

diff --git a/container_images.config b/container_images.config
@@ -8,6 +8,8 @@ params {
         encyclopedia:          'quay.io/protio/encyclopedia:2.12.30-2',
         encyclopedia3_mriffle: 'quay.io/protio/encyclopedia:3.0.0-MRIFFLE',
         qc_pipeline:           'quay.io/mauraisa/dia_qc_report:2.3.1',
-        proteowizard:          'quay.io/protio/pwiz-skyline-i-agree-to-the-vendor-licenses:3.0.24172-63d00b1'
+        proteowizard:          'quay.io/protio/pwiz-skyline-i-agree-to-the-vendor-licenses:3.0.24172-63d00b1',
+        cascadia:              'quay.io/protio/cascadia:0.0.7',
+        cascadia_utils:        'quay.io/protio/cascadia-utils:0.0.3'
     ]
 }
diff --git a/docs/source/overview.rst b/docs/source/overview.rst
@@ -6,11 +6,12 @@ These documents describe a standardized Nextflow workflow for processing **DIA m
 data to quantify peptides and proteins**. The source code for the workflow can be found at:
 https://github.com/mriffle/nf-skyline-dia-ms.
 
-Multiple specific workflows may be run with this Nextflow workflow. Note that in all cases, the
-workflow can automatically generate requested reports from the Skyline document and can automatically
-upload and (optionally) import the Skyline document into PanoramaWeb and ProteomeXchange.
+This workflow supports three search engines: DIA-NN, Encyclopedia, and Cascadia for performing *de novo* searches.
+Each search engine works as a drop-in replacement for the other, supporting all the same pre- and post-analysis steps.
+In all cases, the workflow supports converting RAW files, integrating with PanoramaWeb (ProteomeXchange) and Proteomic Data Commons,
+and will generate a Skyline document suitable for visualization and analysis in Skyline.
 
-Cascadia workflow (coming soon):
+Cascadia workflow:
 ===================================
 The workflow will perform *de novo* identification of peptides using user-supplied DIA RAW (or mzML) files.
 The workflow will generate a Skyline document where users may visualize the *de novo* results and export

diff --git a/docs/source/results.rst b/docs/source/results.rst
@@ -106,6 +106,28 @@ The files present in this directory will be:
 - ``wide.stdout`` - This is the command line output of EncyclopeDIA during this step.
 - ``wide.stderr`` - This is the error output of EncyclopeDIA during this step.
 
+``cascadia`` Subdirectory
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This directory contains the output from Cascadia. There will be some files for each scan file that was searched,
+where all files in that set will have the same base name as the scan file. E.g., if the scan filed was named
+``my_scan_file.raw``, each file in the set would begin with ``my_scan_file``.
+
+The files present for each scan file will be:
+
+- ``my_scan_file.ssl`` - The ssl file containing the search results reported by Cascadia. More about the ssl format: https://skyline.ms/wiki/home/software/BiblioSpec/page.view?name=BiblioSpec%20input%20and%20output%20file%20formats
+- ``my_scan_file.fixed.ssl`` - A processed ssl file where scan numbers have been corrected to align with the input mzML.
+- ``my_scan_file.stderr`` - Any output to standard error generated by Cascadia when searching this file.
+- ``my_scan_file.out`` - Any output to standard out generated by Cascadia when searching this file.
+- ``output_file_stats_my_scan_file.txt`` - A text file containing the MD5 hash of the input mzML and output ssl file generated by Cascadia for this search.
+
+In addition the following files will be present:
+
+- ``cascadia-utils_version.txt`` - The version of the cascadia-utils image used in the workflow. This Docker images contains utility scripts that transform Cascadia output.
+- ``cascadia_version.txt`` - The version of the cascadia image used in the workflow.
+- ``combined.ssl`` - The combined Cascadia results from searching all input raw or mzML files.
+- ``combined.fasta`` - A FASTA format file containing the peptides identified by Cascadia.
+- ``lib.blib`` - A spectral library containing the Cascadia search results.
+
 ``skyline/add-lib`` Subdirectory
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 The first step to creating the final Skyline document is importing the results of EncyclopeDIA into the Skyline template document. This

diff --git a/docs/source/workflow_parameters.rst b/docs/source/workflow_parameters.rst
@@ -58,10 +58,10 @@ The ``params`` Section
      - Description
    * -
      - ``spectral_library``
-     - That path to the spectral library to use. May be a ``dlib``, ``elib``, ``blib``, ``speclib`` (DIA-NN), ``tsv`` (DIA-NN), or other formats supported by EncyclopeDIA or DIA-NN. This parameter is required for EncyclopeDIA. If omitted when using DIA-NN, DIA-NN will be run in library-free mode.
-   * - ✓
+     - That path to the spectral library to use. May be a ``dlib``, ``elib``, ``blib``, ``speclib`` (DIA-NN), ``tsv`` (DIA-NN), or other formats supported by EncyclopeDIA or DIA-NN. This parameter is required for EncyclopeDIA. If omitted when using DIA-NN, DIA-NN will be run in library-free mode. This parameter is ignored when running Cascadia.
+   * - 
      - ``fasta``
-     - The path to the background FASTA file to use.
+     - The path to the background FASTA file to use. This parameter is required, except when running Cascadia.
    * - ✓
      - ``quant_spectra_dir``
      - The path to the directory containing the raw data to be quantified. If using narrow window DIA and GPF to generated a chromatogram library this is the location of the wide-window data to be searched using the chromatogram library.
@@ -76,7 +76,7 @@ The ``params`` Section
      - Which files in this directory to use. Default: ``*.raw``
    * -
      - ``search_engine``
-     - Must be set to either ``'encyclopedia'`` or ``'diann'``. If set to ``'diann'``, ``chromatogram_library_spectra_dir``, ``chromatogram_library_spectra_glob``, and EncyclopeDIA-specific parameters will be ignored. Default: ``'encyclopedia'``.
+     - Must be set to either ``'encyclopedia'``, ``'diann'``, or ``'cascadia'``. If set to ``'diann'`` or ``'cascadia'``, ``chromatogram_library_spectra_dir``, ``chromatogram_library_spectra_glob``, and EncyclopeDIA-specific parameters will be ignored. Default: ``'encyclopedia'``.
    * -
      - ``pdc.study_id``
      - When this option is set, raw files and metadata will be downloaded from the PDC. Default: ``null``.
@@ -116,6 +116,9 @@ The ``params`` Section
    * -
      - ``diann.params``
      - The parameters passed to DIA-NN when it is run. Default: ``'--unimod4 --qvalue 0.01 --cut \'K*,R*,!*P\' --reanalyse --smart-profiling'``
+   * -
+     - ``cascadia.use_gpu``
+     - If set to ``true``, Cascadia will attempt to use the GPU(s) installed on the system where it is running. Do not set to true unless a GPU is available, otherwise an error will be gernated. Default: ``false``.
    * -
      - ``panorama.upload``
      - Whether or not to upload results to PanoramaWeb Default: ``false``.

diff --git a/main.nf b/main.nf
@@ -7,6 +7,7 @@ include { get_input_files } from "./workflows/get_input_files"
 include { encyclopedia_search as encyclopeda_export_elib } from "./workflows/encyclopedia_search"
 include { encyclopedia_search as encyclopedia_quant } from "./workflows/encyclopedia_search"
 include { diann_search } from "./workflows/diann_search"
+include { cascadia_search } from "./workflows/cascadia_search"
 include { get_mzmls as get_narrow_mzmls } from "./workflows/get_mzmls"
 include { get_mzmls as get_wide_mzmls } from "./workflows/get_mzmls"
 include { skyline_import } from "./workflows/skyline_import"
@@ -166,8 +167,14 @@ workflow {
             error "The parameter \'spectral_library\' is required when using EncyclopeDIA."
         }
 
+        if(!params.fasta) {
+            error "The parameter \'fasta\' is required when using EncyclopeDIA."
+        }
+
         all_diann_file_ch = Channel.empty()  // will be no diann
+        all_cascadia_file_ch = Channel.empty()
         diann_version = Channel.empty()
+        cascadia_version = Channel.empty()
 
         // convert blib to dlib if necessary
         if(params.spectral_library.endsWith(".blib")) {
@@ -231,6 +238,10 @@ workflow {
 
     } else if(params.search_engine.toLowerCase() == 'diann') {
 
+        if(!params.fasta) {
+            error "The parameter \'fasta\' is required when using diann."
+        }
+
         if (params.chromatogram_library_spectra_dir != null) {
             log.warn "The parameter 'chromatogram_library_spectra_dir' is set to a value (${params.chromatogram_library_spectra_dir}) but will be ignored."
         }
@@ -275,7 +286,10 @@ workflow {
 
 
         all_elib_ch = Channel.empty()  // will be no encyclopedia
+        all_cascadia_file_ch = Channel.empty()
         encyclopedia_version = Channel.empty()
+        cascadia_version = Channel.empty()
+
         all_mzml_ch = wide_mzml_ch
 
         diann_search(
@@ -311,6 +325,36 @@ workflow {
         ).concat(
             diann_search.out.predicted_speclib
         )
+    } else if(params.search_engine.toLowerCase() == 'cascadia') {
+
+        if (params.spectral_library != null) {
+            log.warn "The parameter 'spectral_library' is set to a value (${params.spectral_library}) but will be ignored."
+        }
+
+        all_elib_ch = Channel.empty()  // will be no encyclopedia
+        all_diann_file_ch = Channel.empty() // will be no diann
+        encyclopedia_version = Channel.empty()
+        diann_version = Channel.empty()
+
+        all_mzml_ch = wide_mzml_ch
+
+        cascadia_search(
+            wide_mzml_ch
+        )
+
+        cascadia_version = cascadia_search.out.cascadia_version
+        search_file_stats = cascadia_search.out.output_file_stats
+        final_elib = cascadia_search.out.blib
+        fasta = cascadia_search.out.fasta
+
+        // all files to upload to panoramaweb (if requested)
+        all_cascadia_file_ch = cascadia_search.out.blib.concat(
+            cascadia_search.out.fasta
+        ).concat(
+            cascadia_search.out.stdout
+        ).concat(
+            cascadia_search.out.stderr
+        )
 
     } else {
         error "'${params.search_engine}' is an invalid argument for params.search_engine!"
@@ -385,6 +429,7 @@ workflow {
 
     version_files = encyclopedia_version.concat(diann_version,
                                                 proteowizard_version,
+                                                cascadia_version,
                                                 dia_qc_version).splitText()
 
     input_files = fasta.map{ it -> ['Fasta file', it.name] }.concat(
@@ -410,6 +455,7 @@ workflow {
             params.panorama.upload_url,
             all_elib_ch,
             all_diann_file_ch,
+            all_cascadia_file_ch,
             final_skyline_file,
             all_mzml_ch,
             fasta,

diff --git a/modules/cascadia.nf b/modules/cascadia.nf
@@ -0,0 +1,193 @@
+process CASCADIA_SEARCH {
+    publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
+    label 'process_high_constant'
+    container params.images.cascadia
+
+    containerOptions = { 
+
+        def options = ''
+        if (params.cascadia.use_gpu) {
+            if (workflow.containerEngine == "singularity" || workflow.containerEngine == "apptainer") {
+                options += ' --nv'
+            } else if (workflow.containerEngine == "docker") {
+                options += ' --gpus all'
+            }
+        }
+
+        return options
+    }
+
+    // don't melt the GPU
+    if (params.cascadia.use_gpu) {
+        maxForks = 1
+    }
+
+    input:
+        path ms_file
+
+    output:
+        path("*.stderr"), emit: stderr
+        path("*.stdout"), emit: stdout
+        tuple(path(ms_file), path("${ms_file.baseName}.ssl"), emit: ssl)
+        path("${ms_file.baseName}.ssl"), emit: published_ssl
+        path("cascadia_version.txt"), emit: version
+        path("output_file_stats_${ms_file.baseName}.txt"), emit: output_file_stats
+
+    script:
+
+        """
+        cascadia sequence ${ms_file} /usr/local/bin/cascadia.ckpt --out ${ms_file.baseName}
+            > >(tee "${ms_file.baseName}.stdout") 2> >(tee "${ms_file.baseName}.stderr" >&2)
+
+        echo "${params.images.cascadia}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia_version=%s\n" > cascadia_version.txt
+
+        md5sum '${ms_file.join('\' \'')}' ${ms_file.baseName}.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
+        stat -L --printf='%n\t%s\n' '${ms_file.join('\' \'')}' ${ms_file.baseName}.ssl | sort > sizes.txt
+        join -t'\t' hashes.txt sizes.txt > output_file_stats_${ms_file.baseName}.txt
+        """
+
+    stub:
+        """
+        touch stub.ssl
+        touch stub.stderr stub.stdout
+        echo "${params.images.cascadia}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia_version=%s\n" > cascadia_version.txt
+
+        md5sum '${ms_file.join('\' \'')}' stub.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
+        stat -L --printf='%n\t%s\n' '${ms_file.join('\' \'')}' stub.ssl | sort > sizes.txt
+        join -t'\t' hashes.txt sizes.txt > output_file_stats_${ms_file.baseName}.txt
+        """
+}
+
+process CASCADIA_FIX_SCAN_NUMBERS {
+    publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
+    label 'process_medium'
+    container params.images.cascadia_utils
+
+    input:
+        tuple path(ms_file), path(ssl_file)
+
+    output:
+        path("*.stderr"), emit: stderr
+        path("*.stdout"), emit: stdout
+        path("${ssl_file.baseName}.fixed.ssl"), emit: fixed_ssl
+        path("cascadia-utils_version.txt"), emit: version
+        path("output_file_stats.txt"), emit: output_file_stats
+
+    script:
+
+        """
+        python3 /usr/local/bin/fix_scan_numbers.py ${ssl_file} ${ms_file} ${ssl_file.baseName}.fixed.ssl
+            > >(tee "fix_scan_numbers.stdout") 2> >(tee "fix_scan_numbers.stderr" >&2)
+
+        echo "${params.images.cascadia_utils}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia-utils_version=%s\n" > cascadia-utils_version.txt
+
+        md5sum ${ms_file} ${ssl_file} ${ssl_file.baseName}.fixed.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
+        stat -L --printf='%n\t%s\n' ${ms_file} ${ssl_file} ${ssl_file.baseName}.fixed.ssl | sort > sizes.txt
+        join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
+        """
+
+    stub:
+        """
+        touch stub.fixed.ssl
+        echo "${params.images.cascadia_utils}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia-utils_version=%s\n" > cascadia-utils_version.txt
+
+        md5sum ${ms_file} ${ssl_file} stub.fixed.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
+        stat -L --printf='%n\t%s\n' ${ms_file} ${ssl_file} stub.fixed.ssl | sort > sizes.txt
+        join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
+        """
+}
+
+process CASCADIA_CREATE_FASTA {
+    publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
+    label 'process_medium'
+    container params.images.cascadia_utils
+
+    input:
+        path ssl_file
+
+    output:
+        path("*.stderr"), emit: stderr
+        path("*.stdout"), emit: stdout
+        path("${ssl_file.baseName}.fasta"), emit: fasta
+        path("cascadia-utils_version.txt"), emit: version
+        path("output_file_stats.txt"), emit: output_file_stats
+
+    script:
+
+        """
+        python3 /usr/local/bin/create_fasta_from_ssl.py ${ssl_file} ${ssl_file.baseName}.fasta
+            > >(tee "create_fasta.stdout") 2> >(tee "create_fasta.stderr" >&2)
+
+        echo "${params.images.cascadia_utils}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia-utils_version=%s\n" > cascadia-utils_version.txt
+
+        md5sum ${ssl_file} ${ssl_file.baseName}.fasta | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
+        stat -L --printf='%n\t%s\n' ${ssl_file} ${ssl_file.baseName}.fasta | sort > sizes.txt
+        join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
+        """
+
+    stub:
+        """
+        touch stub.ssl stub.fasta
+        echo "${params.images.cascadia_utils}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia-utils_version=%s\n" > cascadia-utils_version.txt
+
+        md5sum stub.ssl stub.fasta | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
+        stat -L --printf='%n\t%s\n' stub.ssl stub.fasta | sort > sizes.txt
+        join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
+        """
+}
+
+process CASCADIA_COMBINE_SSL_FILES {
+    publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
+    label 'process_medium'
+    container params.images.cascadia_utils
+
+    input:
+        path ssl_files
+
+    output:
+        path("combined.ssl"), emit: ssl
+        path("output_file_stats.txt"), emit: output_file_stats
+
+    script:
+
+        """
+        python3 /usr/local/bin/combine_ssl_files.py *.ssl > combined.ssl
+
+        md5sum '${ssl_files.join('\' \'')}' combined.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
+        stat -L --printf='%n\t%s\n' '${ssl_files.join('\' \'')}' combined.ssl | sort > sizes.txt
+        join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
+        """
+
+    stub:
+        """
+        touch combined.ssl
+
+        md5sum '${ssl_files.join('\' \'')}' combined.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
+        stat -L --printf='%n\t%s\n' '${ssl_files.join('\' \'')}' combined.ssl | sort > sizes.txt
+        join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
+        """
+}
+
+process BLIB_BUILD_LIBRARY {
+    publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
+    label 'process_medium'
+    container params.images.bibliospec
+
+    input:
+        path ssl
+        path mzml_files
+
+    output:
+        path('lib.blib'), emit: blib
+
+    script:
+        """
+        BlibBuild "${ssl}" lib_redundant.blib
+        BlibFilter -b 1 lib_redundant.blib lib.blib
+        """
+
+    stub:
+        """
+        touch lib.blib
+        """
+}
diff --git a/nextflow.config b/nextflow.config
@@ -53,6 +53,9 @@ params {
     // the generated chromatogram library (elib) will always be saved, regardless of this setting
     encyclopedia.save_output            = true
 
+    // options for Cascadia (de novo DIA search)
+    cascadia.use_gpu = false;       // whether or not to use available GPU, must be set to false if no GPU is available
+
     // optional user-supplied parameters
     email = null                    // email to notify of workflow outcome, leave null to send no email
     skyline.template_file = null    // the skyline template, if null use default_skyline_template_file