-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add InterProScan to Pipeline and integrate in AMPcombi #428
base: dev
Are you sure you want to change the base?
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.1.0. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
@nf-core-bot fix linting |
@nf-core-bot fix linting |
Also fixes issue number #434 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main issue is I don't like the use of function
, we already use functon
in funcscan
in a broad sense... can you refine what exactly we are using interproscan for and then we can adjust the naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
Also missing README update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work overall 💪
Now that you introduce the protein_annotation
workflow, I wonder if we should rename the DNA-level annotation
workflow (of pyrodigal, bakta etc.). Maybe to contig_annotation
, cds_annotation
, or orf_annotation
?
withName: SEQKIT_SEQ_FILTER { | ||
ext.prefix = { "${meta.id}_cleaned.faa" } | ||
publishDir = [ | ||
path: { "${params.outdir}/protein_annotation/interproscan/" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want the output in ${params.outdir}/protein_annotation/interproscan/
and not in ${params.outdir}/annotation/interproscan/
? I'd prefer the latter, to have it all in one place regardless of DNA (pyrodigal etc.) or protein annotation (interproscan). I think it's more intuitive to search for any annotation results in a single folder.
If not, what do you think of renaming the annotation
output folder to contig_annotation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other annotation tools in the annotation workflow are annotating CDS which also can be proteins. I would leave the annotation workflow as is because thats the baseline annotation step of the pipeline. This annotation step is more of an accessory annotation to the pipeline if the user wants more information from diff DB and (1) its technically not correct protein_annotation because those are nnot necessary proteins and (2) the plan is that we add more to this workflow (e.g. the functional annotation of those CDS) so protein annotation will no longer be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ouput im not sure if its a good idea to add it in the same folder because those are two diff workflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. No strong opinion from my side. I still don't find the naming super intuitive, but it can be. Or why not cds_annotation
for the CDS tools (prokka etc.) versus protein_annotation
(interproscan)?
ch_versions = ch_versions.mix( INTERPROSCAN.out.versions ) | ||
ch_interproscan_tsv = ch_interproscan_tsv.mix( INTERPROSCAN.out.tsv ) | ||
|
||
// Current INTERPROSCAN version 5.59_91.0 only includes 13 columns and not 15 which ampcombi expects, so we added them here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this something to solve upstream on AMPcombi side? 😬 Is ok for now I guess, but better to have this column number check done by AMPcombi instead of pipeline level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And thats where updating the database version is not a good idea before testing.
AMPcombi was developed on an updated version of InterProScan to the one available as a nf-core module. So its rather updating the module on nf-core than downgrading AMPcombi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! Fair point if AMPcombi is tested to work with a specific interproscan version 👍 Just need to remember to edit this code when interproscan nf-core module gets updated one day. We could do it 🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check if it's feasible to test the most recent interproscan db on our server in order to update it here. If not, then go ahead as is. 👍
"--applications ${params.protein_annotation_interproscan_applications}", | ||
params.protein_annotation_interproscan_enableprecalc ? '' : '--disable-precalc', | ||
params.protein_annotation_interproscan_enableresidueannot ? '' : '--disable-residue-annot', | ||
params.protein_annotation_interproscan_disableresidueannottsv ? '--enable-tsv-residue-annot' : '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm no, I changed to enable it by default, i.e. if the param is false (which is default), then activate --enable-tsv-residue-annot
. Or am I missing some logic of this?
curl -L https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/interproscan-5.67-99.0-64-bit.tar.gz -o interproscan_db/interproscan-5.67-99.0-64-bit.tar.gz | ||
tar -xzf interproscan_db/interproscan-5.67-99.0-64-bit.tar.gz -C interproscan_db/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I didn't. But you're right, then I will download it to our server and test it; would be nice to have a more recent version towards the pipeline release.
}, | ||
"protein_annotation_interproscan_disableresidueannottsv": { | ||
"type": "boolean", | ||
"help_text": "This disables the addition of the annotations assigned according to the databases activated to the final `<sample>_interproscan.tsv` file. Turning this on will remove the annotations from the final table. It is not recommended to use this option, as it will cause a run failure because the format of the resulting files will no longer be adequate for integration in the final summary tables. Currently, only applicable for AMPcombi2. \n\nFor more information about this flag see the tool [documentation](https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html).\n\n> Modifies tool parameter(s):\n> - InterProScan: `--enable-tsv-residue-annot`\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They cannot add it by ext.arg :D Everything that should be ext.arg is a pipeline parameter. If we don't offer it, the user cannot use it. That's why we add as much as necessary and leave out everything that shouldn't be touched (or is too irrelevant).
ch_versions = ch_versions.mix( INTERPROSCAN.out.versions ) | ||
ch_interproscan_tsv = ch_interproscan_tsv.mix( INTERPROSCAN.out.tsv ) | ||
|
||
// Current INTERPROSCAN version 5.59_91.0 only includes 13 columns and not 15 which ampcombi expects, so we added them here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! Fair point if AMPcombi is tested to work with a specific interproscan version 👍 Just need to remember to edit this code when interproscan nf-core module gets updated one day. We could do it 🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things to do before merge:
- address the database update comments from my previous review since I tested the new interpro scan db version today (see comment here)
- either fix ampcombi/parsetables error as reported in AMP screening doesn't run when only a single tool is enabled #444 or outsource this to a different PR
PR checklist
This PR adds InterProScan to FUNCSCAN. It also integrates it into AMPcombi v2.0.1, which can parse its output as an optional flag.
This PR also closes issue #434
🚨 🚨 As interproscan requires a large database, i have not added it to any of the CI tests as that would require 4 hours for just downloading the database!!!!
👀 👀 👀 👀 👀 👀 Still TODO once AMPcombi 2.0.1 is updated in nf-core: DONE!!nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).