-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dsl2: metagenomics uncollapsed paired end #1098
base: dev
Are you sure you want to change the base?
Conversation
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.1.2. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
|
dsl-2 metagenomics-pairedend
Ready for review: Basic Overview: We can now retain PE information from unmerged input fastq files (when PE samples going into metagenomics will be separately input into metaphlan, kraken2, and krakenuniq. Malt does not accept PE reads so it is not allowed with the param combo. Also fixed a small error in the parsing of metagenomics input data from bamfiltering --> metagnomics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on visual inspection, it looks good to me!
I would remove the one map-channel manipulation in metagenomics (see comment) as it is not necessary anymore
Also I have not tested anything, so I trust your tests for now :P
subworkflows/local/bamfiltering.nf
Outdated
} | ||
|
||
// Routing for metagenomic screening -> first accounting for paired-end mapping, then merged mapping, then no metagenomics | ||
if ( ( params.run_metagenomics && params.metagenomics_input == 'unmapped' ) && params.preprocessing_skippairmerging ) { | ||
ch_fastq_for_metagenomics = CAT_FASTQ_UNMAPPED.out.reads | ||
ch_fastq_for_metagenomics = CAT_FASTQ_UNMAPPED.out.reads.mix(ch_paired_fastq_for_cat) | ||
} else if ( ( params.run_metagenomics && ( params.metagenomics_input == 'mapped' || params.metagenomics_input == 'all' ) ) && params.preprocessing_skippairmerging ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if input now is 'all', the unmapped reads are not taken into account.
I think that needs one more if-condition, with something like
if metagenomics_input == 'all':
CAT_FASTQ_MAPPED.out.reads.mix(CAT_FASTQ_UNMAPPED.out.reads).mix(ch_paired_fastq_for_cat)
but I might miss something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the unmapped reads are not filtered out? then it makes sense...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the splitting of all vs only mapped reads is dealt with at the modules.config level.
withName: SAMTOOLS_FASTQ_MAPPED { tag = { "${meta.reference}|${meta.sample_id}_${meta.library_id}" } ext.args = [ params.metagenomics_input == 'all' ? '' : '-F 4',
We could split the module/import into an additional SAMTOOLS_FASTQ_ALL, for clarity for anyone looking into this section of the pipeline... @jfy133 what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted this whole section to imporve clarity. please see latest commit.
essentially the splitting of the fastq and cat into unmapped and mapped versions is unnecessary. all parameters can be parsed fine within modules config.
i confirmed that the md5sum of the final files sent to the ch_fastq_for_metagenomics channel are consistent across all the parameter combos that we were there.
|
||
So: needs to be fixed up higher (eg in bamfiltering.nf, likely with a new adjustment to the SAMTOOLS_FASTQ_UNMAPPED, SAMTOOLS_FASTQ_MAPPED, and SAMTOOLS_VIEW_BAM_FILTERING modules ) | ||
|
||
ISSUE FOUND: while the outputting of PE reads is OK in bamfiltering.nf (fastq_mapped & fastq_unmapped) when overlap merging is not done cat_fastq weirdly merges singletons to one PE file and other to the other PE file, so then everything gets fucked up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a hackathon-thing to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me double check this behavior and add more context of what we might need to do in a hackathon... i this was more for my own notes reference but was not edited prior to opening the PR.
@@ -137,20 +137,32 @@ workflow METAGENOMICS_PROFILING { | |||
} | |||
|
|||
else if ( params.metagenomics_profiling_tool == 'krakenuniq' ) { | |||
// run krakenuniq once for all samples | |||
// run krakenuniq once for all samples, unless non-merged PE vs SE data | |||
|
|||
ch_krakenuniq_input = ch_reads | |||
.map{ meta, file -> | |||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the meta already contains the informatino, this whole map can go away. I created this map only to add the 'single_end' back to the meta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we do want to have 2 groups of files for input into krakenuniq (which runs all samples in a single execution): one single_end group and one paired end group. Therefore for groupTuple() we need to have a new meta-like field (not in meta explicitly) that keeps just the SE vs PE info -> sort
@@ -118,10 +118,6 @@ workflow PIPELINE_INITIALISATION { | |||
} | |||
[ meta, r1, r2, bam ] | |||
} | |||
.groupTuple() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make sure - this was the weird input-validation that did nothing and randomly failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct - the .groupTuple() is nonsense here since each meta will be unique per sample, and then the validateInputSamplesheet is not necessary for our check (the validation fails when a single sample has both PE and SE libraries... but we deal with that. therefore removed.
…tering --> fastq for metagenomics. improves clarity (IMO)
ch_fastq_for_metagenomics = SAMTOOLS_FASTQ_MAPPED.out.other | ||
} else if ( !params.run_metagenomics ) { | ||
// No length/quality filtering applies to metagenomic bam files (could be extension) | ||
// All bam -> fastq filtering options (-F 4, -f 4 or none) will be dealt with within modules.config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TCLamnidis please take a quick look -- i restructured everything into a single call, since the modules.config is easily robust enough to deal with the splitting. and it allows for much cleaner code in this section which should help going forward
PR for metagenomics paired-end merging off, retaining that info for metagenomic tools that have PE modes.
This is a QOL improvement for tools that can use paired end information (kraken2, krakenuniq), and now allows for variable inputs into these tools (by splitting inputs into either PE or SE data and running separate instances).
Main issues resolved:
--preprocessing_skippairmerging
used.PR checklist
scrape_software_versions.py
nf-core lint .
).nextflow run . -profile test,docker
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).