Skip to content

Commit

Permalink
add support for Cascadia search engine
Browse files Browse the repository at this point in the history
  • Loading branch information
mriffle committed Feb 24, 2025
1 parent 5b58b2f commit 06671c3
Show file tree
Hide file tree
Showing 11 changed files with 345 additions and 13 deletions.
1 change: 1 addition & 0 deletions conf/output_directories.config
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ params {
aws: "${params.result_dir}/aws",
msconvert: "${params.result_dir}/msconvert",
diann: "${params.result_dir}/diann",
cascadia: "${params.result_dir}/cascadia",
qc_report: "${params.result_dir}/qc_report",
qc_report_tables: "${params.result_dir}/qc_report/tables",
gene_reports: "${params.result_dir}/gene_reports",
Expand Down
4 changes: 3 additions & 1 deletion container_images.config
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ params {
encyclopedia: 'quay.io/protio/encyclopedia:2.12.30-2',
encyclopedia3_mriffle: 'quay.io/protio/encyclopedia:3.0.0-MRIFFLE',
qc_pipeline: 'quay.io/mauraisa/dia_qc_report:2.3.1',
proteowizard: 'quay.io/protio/pwiz-skyline-i-agree-to-the-vendor-licenses:3.0.24172-63d00b1'
proteowizard: 'quay.io/protio/pwiz-skyline-i-agree-to-the-vendor-licenses:3.0.24172-63d00b1',
cascadia: 'quay.io/protio/cascadia:0.0.7',
cascadia_utils: 'quay.io/protio/cascadia-utils:0.0.3'
]
}
9 changes: 5 additions & 4 deletions docs/source/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@ These documents describe a standardized Nextflow workflow for processing **DIA m
data to quantify peptides and proteins**. The source code for the workflow can be found at:
https://github.com/mriffle/nf-skyline-dia-ms.

Multiple specific workflows may be run with this Nextflow workflow. Note that in all cases, the
workflow can automatically generate requested reports from the Skyline document and can automatically
upload and (optionally) import the Skyline document into PanoramaWeb and ProteomeXchange.
This workflow supports three search engines: DIA-NN, Encyclopedia, and Cascadia for performing *de novo* searches.
Each search engine works as a drop-in replacement for the other, supporting all the same pre- and post-analysis steps.
In all cases, the workflow supports converting RAW files, integrating with PanoramaWeb (ProteomeXchange) and Proteomic Data Commons,
and will generate a Skyline document suitable for visualization and analysis in Skyline.

Cascadia workflow (coming soon):
Cascadia workflow:
===================================
The workflow will perform *de novo* identification of peptides using user-supplied DIA RAW (or mzML) files.
The workflow will generate a Skyline document where users may visualize the *de novo* results and export
Expand Down
22 changes: 22 additions & 0 deletions docs/source/results.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,28 @@ The files present in this directory will be:
- ``wide.stdout`` - This is the command line output of EncyclopeDIA during this step.
- ``wide.stderr`` - This is the error output of EncyclopeDIA during this step.

``cascadia`` Subdirectory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This directory contains the output from Cascadia. There will be some files for each scan file that was searched,
where all files in that set will have the same base name as the scan file. E.g., if the scan filed was named
``my_scan_file.raw``, each file in the set would begin with ``my_scan_file``.

The files present for each scan file will be:

- ``my_scan_file.ssl`` - The ssl file containing the search results reported by Cascadia. More about the ssl format: https://skyline.ms/wiki/home/software/BiblioSpec/page.view?name=BiblioSpec%20input%20and%20output%20file%20formats
- ``my_scan_file.fixed.ssl`` - A processed ssl file where scan numbers have been corrected to align with the input mzML.
- ``my_scan_file.stderr`` - Any output to standard error generated by Cascadia when searching this file.
- ``my_scan_file.out`` - Any output to standard out generated by Cascadia when searching this file.
- ``output_file_stats_my_scan_file.txt`` - A text file containing the MD5 hash of the input mzML and output ssl file generated by Cascadia for this search.

In addition the following files will be present:

- ``cascadia-utils_version.txt`` - The version of the cascadia-utils image used in the workflow. This Docker images contains utility scripts that transform Cascadia output.
- ``cascadia_version.txt`` - The version of the cascadia image used in the workflow.
- ``combined.ssl`` - The combined Cascadia results from searching all input raw or mzML files.
- ``combined.fasta`` - A FASTA format file containing the peptides identified by Cascadia.
- ``lib.blib`` - A spectral library containing the Cascadia search results.

``skyline/add-lib`` Subdirectory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first step to creating the final Skyline document is importing the results of EncyclopeDIA into the Skyline template document. This
Expand Down
11 changes: 7 additions & 4 deletions docs/source/workflow_parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,10 @@ The ``params`` Section
- Description
* -
- ``spectral_library``
- That path to the spectral library to use. May be a ``dlib``, ``elib``, ``blib``, ``speclib`` (DIA-NN), ``tsv`` (DIA-NN), or other formats supported by EncyclopeDIA or DIA-NN. This parameter is required for EncyclopeDIA. If omitted when using DIA-NN, DIA-NN will be run in library-free mode.
* -
- That path to the spectral library to use. May be a ``dlib``, ``elib``, ``blib``, ``speclib`` (DIA-NN), ``tsv`` (DIA-NN), or other formats supported by EncyclopeDIA or DIA-NN. This parameter is required for EncyclopeDIA. If omitted when using DIA-NN, DIA-NN will be run in library-free mode. This parameter is ignored when running Cascadia.
* -
- ``fasta``
- The path to the background FASTA file to use.
- The path to the background FASTA file to use. This parameter is required, except when running Cascadia.
* - ✓
- ``quant_spectra_dir``
- The path to the directory containing the raw data to be quantified. If using narrow window DIA and GPF to generated a chromatogram library this is the location of the wide-window data to be searched using the chromatogram library.
Expand All @@ -76,7 +76,7 @@ The ``params`` Section
- Which files in this directory to use. Default: ``*.raw``
* -
- ``search_engine``
- Must be set to either ``'encyclopedia'`` or ``'diann'``. If set to ``'diann'``, ``chromatogram_library_spectra_dir``, ``chromatogram_library_spectra_glob``, and EncyclopeDIA-specific parameters will be ignored. Default: ``'encyclopedia'``.
- Must be set to either ``'encyclopedia'``, ``'diann'``, or ``'cascadia'``. If set to ``'diann'`` or ``'cascadia'``, ``chromatogram_library_spectra_dir``, ``chromatogram_library_spectra_glob``, and EncyclopeDIA-specific parameters will be ignored. Default: ``'encyclopedia'``.
* -
- ``pdc.study_id``
- When this option is set, raw files and metadata will be downloaded from the PDC. Default: ``null``.
Expand Down Expand Up @@ -116,6 +116,9 @@ The ``params`` Section
* -
- ``diann.params``
- The parameters passed to DIA-NN when it is run. Default: ``'--unimod4 --qvalue 0.01 --cut \'K*,R*,!*P\' --reanalyse --smart-profiling'``
* -
- ``cascadia.use_gpu``
- If set to ``true``, Cascadia will attempt to use the GPU(s) installed on the system where it is running. Do not set to true unless a GPU is available, otherwise an error will be gernated. Default: ``false``.
* -
- ``panorama.upload``
- Whether or not to upload results to PanoramaWeb Default: ``false``.
Expand Down
46 changes: 46 additions & 0 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ include { get_input_files } from "./workflows/get_input_files"
include { encyclopedia_search as encyclopeda_export_elib } from "./workflows/encyclopedia_search"
include { encyclopedia_search as encyclopedia_quant } from "./workflows/encyclopedia_search"
include { diann_search } from "./workflows/diann_search"
include { cascadia_search } from "./workflows/cascadia_search"
include { get_mzmls as get_narrow_mzmls } from "./workflows/get_mzmls"
include { get_mzmls as get_wide_mzmls } from "./workflows/get_mzmls"
include { skyline_import } from "./workflows/skyline_import"
Expand Down Expand Up @@ -166,8 +167,14 @@ workflow {
error "The parameter \'spectral_library\' is required when using EncyclopeDIA."
}

if(!params.fasta) {
error "The parameter \'fasta\' is required when using EncyclopeDIA."
}

all_diann_file_ch = Channel.empty() // will be no diann
all_cascadia_file_ch = Channel.empty()
diann_version = Channel.empty()
cascadia_version = Channel.empty()

// convert blib to dlib if necessary
if(params.spectral_library.endsWith(".blib")) {
Expand Down Expand Up @@ -231,6 +238,10 @@ workflow {

} else if(params.search_engine.toLowerCase() == 'diann') {

if(!params.fasta) {
error "The parameter \'fasta\' is required when using diann."
}

if (params.chromatogram_library_spectra_dir != null) {
log.warn "The parameter 'chromatogram_library_spectra_dir' is set to a value (${params.chromatogram_library_spectra_dir}) but will be ignored."
}
Expand Down Expand Up @@ -275,7 +286,10 @@ workflow {


all_elib_ch = Channel.empty() // will be no encyclopedia
all_cascadia_file_ch = Channel.empty()
encyclopedia_version = Channel.empty()
cascadia_version = Channel.empty()

all_mzml_ch = wide_mzml_ch

diann_search(
Expand Down Expand Up @@ -311,6 +325,36 @@ workflow {
).concat(
diann_search.out.predicted_speclib
)
} else if(params.search_engine.toLowerCase() == 'cascadia') {

if (params.spectral_library != null) {
log.warn "The parameter 'spectral_library' is set to a value (${params.spectral_library}) but will be ignored."
}

all_elib_ch = Channel.empty() // will be no encyclopedia
all_diann_file_ch = Channel.empty() // will be no diann
encyclopedia_version = Channel.empty()
diann_version = Channel.empty()

all_mzml_ch = wide_mzml_ch

cascadia_search(
wide_mzml_ch
)

cascadia_version = cascadia_search.out.cascadia_version
search_file_stats = cascadia_search.out.output_file_stats
final_elib = cascadia_search.out.blib
fasta = cascadia_search.out.fasta

// all files to upload to panoramaweb (if requested)
all_cascadia_file_ch = cascadia_search.out.blib.concat(
cascadia_search.out.fasta
).concat(
cascadia_search.out.stdout
).concat(
cascadia_search.out.stderr
)

} else {
error "'${params.search_engine}' is an invalid argument for params.search_engine!"
Expand Down Expand Up @@ -385,6 +429,7 @@ workflow {

version_files = encyclopedia_version.concat(diann_version,
proteowizard_version,
cascadia_version,
dia_qc_version).splitText()

input_files = fasta.map{ it -> ['Fasta file', it.name] }.concat(
Expand All @@ -410,6 +455,7 @@ workflow {
params.panorama.upload_url,
all_elib_ch,
all_diann_file_ch,
all_cascadia_file_ch,
final_skyline_file,
all_mzml_ch,
fasta,
Expand Down
193 changes: 193 additions & 0 deletions modules/cascadia.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
process CASCADIA_SEARCH {
publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
label 'process_high_constant'
container params.images.cascadia

containerOptions = {

def options = ''
if (params.cascadia.use_gpu) {
if (workflow.containerEngine == "singularity" || workflow.containerEngine == "apptainer") {
options += ' --nv'
} else if (workflow.containerEngine == "docker") {
options += ' --gpus all'
}
}

return options
}

// don't melt the GPU
if (params.cascadia.use_gpu) {
maxForks = 1
}

input:
path ms_file

output:
path("*.stderr"), emit: stderr
path("*.stdout"), emit: stdout
tuple(path(ms_file), path("${ms_file.baseName}.ssl"), emit: ssl)
path("${ms_file.baseName}.ssl"), emit: published_ssl
path("cascadia_version.txt"), emit: version
path("output_file_stats_${ms_file.baseName}.txt"), emit: output_file_stats

script:

"""
cascadia sequence ${ms_file} /usr/local/bin/cascadia.ckpt --out ${ms_file.baseName}
> >(tee "${ms_file.baseName}.stdout") 2> >(tee "${ms_file.baseName}.stderr" >&2)
echo "${params.images.cascadia}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia_version=%s\n" > cascadia_version.txt
md5sum '${ms_file.join('\' \'')}' ${ms_file.baseName}.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
stat -L --printf='%n\t%s\n' '${ms_file.join('\' \'')}' ${ms_file.baseName}.ssl | sort > sizes.txt
join -t'\t' hashes.txt sizes.txt > output_file_stats_${ms_file.baseName}.txt
"""

stub:
"""
touch stub.ssl
touch stub.stderr stub.stdout
echo "${params.images.cascadia}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia_version=%s\n" > cascadia_version.txt
md5sum '${ms_file.join('\' \'')}' stub.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
stat -L --printf='%n\t%s\n' '${ms_file.join('\' \'')}' stub.ssl | sort > sizes.txt
join -t'\t' hashes.txt sizes.txt > output_file_stats_${ms_file.baseName}.txt
"""
}

process CASCADIA_FIX_SCAN_NUMBERS {
publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
label 'process_medium'
container params.images.cascadia_utils

input:
tuple path(ms_file), path(ssl_file)

output:
path("*.stderr"), emit: stderr
path("*.stdout"), emit: stdout
path("${ssl_file.baseName}.fixed.ssl"), emit: fixed_ssl
path("cascadia-utils_version.txt"), emit: version
path("output_file_stats.txt"), emit: output_file_stats

script:

"""
python3 /usr/local/bin/fix_scan_numbers.py ${ssl_file} ${ms_file} ${ssl_file.baseName}.fixed.ssl
> >(tee "fix_scan_numbers.stdout") 2> >(tee "fix_scan_numbers.stderr" >&2)
echo "${params.images.cascadia_utils}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia-utils_version=%s\n" > cascadia-utils_version.txt
md5sum ${ms_file} ${ssl_file} ${ssl_file.baseName}.fixed.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
stat -L --printf='%n\t%s\n' ${ms_file} ${ssl_file} ${ssl_file.baseName}.fixed.ssl | sort > sizes.txt
join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
"""

stub:
"""
touch stub.fixed.ssl
echo "${params.images.cascadia_utils}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia-utils_version=%s\n" > cascadia-utils_version.txt
md5sum ${ms_file} ${ssl_file} stub.fixed.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
stat -L --printf='%n\t%s\n' ${ms_file} ${ssl_file} stub.fixed.ssl | sort > sizes.txt
join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
"""
}

process CASCADIA_CREATE_FASTA {
publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
label 'process_medium'
container params.images.cascadia_utils

input:
path ssl_file

output:
path("*.stderr"), emit: stderr
path("*.stdout"), emit: stdout
path("${ssl_file.baseName}.fasta"), emit: fasta
path("cascadia-utils_version.txt"), emit: version
path("output_file_stats.txt"), emit: output_file_stats

script:

"""
python3 /usr/local/bin/create_fasta_from_ssl.py ${ssl_file} ${ssl_file.baseName}.fasta
> >(tee "create_fasta.stdout") 2> >(tee "create_fasta.stderr" >&2)
echo "${params.images.cascadia_utils}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia-utils_version=%s\n" > cascadia-utils_version.txt
md5sum ${ssl_file} ${ssl_file.baseName}.fasta | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
stat -L --printf='%n\t%s\n' ${ssl_file} ${ssl_file.baseName}.fasta | sort > sizes.txt
join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
"""

stub:
"""
touch stub.ssl stub.fasta
echo "${params.images.cascadia_utils}" | egrep -o '[0-9]+\\.[0-9]+\\.[0-9]+' | xargs printf "cascadia-utils_version=%s\n" > cascadia-utils_version.txt
md5sum stub.ssl stub.fasta | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
stat -L --printf='%n\t%s\n' stub.ssl stub.fasta | sort > sizes.txt
join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
"""
}

process CASCADIA_COMBINE_SSL_FILES {
publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
label 'process_medium'
container params.images.cascadia_utils

input:
path ssl_files

output:
path("combined.ssl"), emit: ssl
path("output_file_stats.txt"), emit: output_file_stats

script:

"""
python3 /usr/local/bin/combine_ssl_files.py *.ssl > combined.ssl
md5sum '${ssl_files.join('\' \'')}' combined.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
stat -L --printf='%n\t%s\n' '${ssl_files.join('\' \'')}' combined.ssl | sort > sizes.txt
join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
"""

stub:
"""
touch combined.ssl
md5sum '${ssl_files.join('\' \'')}' combined.ssl | sed -E 's/([a-f0-9]{32}) [ \\*](.*)/\\2\\t\\1/' | sort > hashes.txt
stat -L --printf='%n\t%s\n' '${ssl_files.join('\' \'')}' combined.ssl | sort > sizes.txt
join -t'\t' hashes.txt sizes.txt > output_file_stats.txt
"""
}

process BLIB_BUILD_LIBRARY {
publishDir params.output_directories.cascadia, failOnError: true, mode: 'copy'
label 'process_medium'
container params.images.bibliospec

input:
path ssl
path mzml_files

output:
path('lib.blib'), emit: blib

script:
"""
BlibBuild "${ssl}" lib_redundant.blib
BlibFilter -b 1 lib_redundant.blib lib.blib
"""

stub:
"""
touch lib.blib
"""
}
3 changes: 3 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ params {
// the generated chromatogram library (elib) will always be saved, regardless of this setting
encyclopedia.save_output = true

// options for Cascadia (de novo DIA search)
cascadia.use_gpu = false; // whether or not to use available GPU, must be set to false if no GPU is available

// optional user-supplied parameters
email = null // email to notify of workflow outcome, leave null to send no email
skyline.template_file = null // the skyline template, if null use default_skyline_template_file
Expand Down
Loading

0 comments on commit 06671c3

Please sign in to comment.