Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InterProScan to Pipeline and integrate in AMPcombi #428

Open
wants to merge 22 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- [#421](https://github.com/nf-core/funcscan/pull/421) Updated to nf-core template 3.0.2. (by @jfy133)
- [#427](https://github.com/nf-core/funcscan/pull/427) AMPcombi now can use multiple other databases for classifications. (by @darcy220606)
- [#428](https://github.com/nf-core/funcscan/pull/428) Added InterProScan annotation workflow to the pipeline. The results are coupled to AMPcombi final table. (by @darcy220606)
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved
- [#429](https://github.com/nf-core/funcscan/pull/429) Updated to nf-core template 3.1.0. (by @jfy133 and @jasmezz)

### `Fixed`
Expand All @@ -18,11 +19,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Dependencies`

| Tool | Previous version | New version |
| -------- | ---------------- | ----------- |
| AMPcombi | 0.2.2 | 2.0.1 |
| Macrel | 1.2.0 | 1.4.0 |
| MultiQC | 1.24.0 | 1.25.1 |
| Tool | Previous version | New version |
| ------------ | ---------------- | ----------- |
| AMPcombi | 0.2.2 | 2.0.1 |
| Macrel | 1.2.0 | 1.4.0 |
| MultiQC | 1.24.0 | 1.25.1 |
| InterProScan | - | 5.59_91.0 |

### `Deprecated`

Expand Down
8 changes: 8 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,14 @@

> Eddy S. R. (2011). Accelerated Profile HMM Searches. PLoS computational biology, 7(10), e1002195. [DOI: 10.1371/journal.pcbi.1002195](https://doi.org/10.1371/journal.pcbi.1002195)

- [InterPro](https://doi.org/10.1093/nar/gkaa977)

> Blum, M., Chang, H-Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka, G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G.A., Williams, L., Bork, P., Bridge, A., Gough, J., Haft, D.H., Letunic, I., Marchler-Bauer, A., Mi, H., Natale, D.A., Necci, M., Orengo, C.A., Pandurangan, A.P., Rivoire, C., Sigrist, C.A., Sillitoe, I., Thanki, N., Thomas, P.D., Tosatto, S.C.E, Wu, C.H., Bateman, A., Finn, R.D. (2021) The InterPro protein families and domains database: 20 years on, Nucleic Acids Research, 49(D1), D344–D354.[DOI: 10.1093/nar/gkaa977](https://doi.org/10.1093/nar/gkaa977).
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved

- [InterProScan](https://doi.org/10.1093/bioinformatics/btu031)

> Jones, P., Binns, D., Chang, H-Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A.F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S-Y., Lopez, R., Hunter, S. (2014)InterProScan 5: genome-scale protein function classification, Bioinformatics, 30(9), 1236–1240. [DOI: 10.1093/bioinformatics/btu031](https://doi.org/10.1093/bioinformatics/btu031)
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved

- [Macrel](https://doi.org/10.7717/peerj.10555)

> Santos-Júnior, C. D., Pan, S., Zhao, X. M., & Coelho, L. P. (2020). Macrel: antimicrobial peptide screening in genomes and metagenomes. PeerJ, 8, e10555. [DOI: 10.7717/peerj.10555](https://doi.org/10.7717/peerj.10555)
Expand Down
7 changes: 7 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -230,4 +230,11 @@ process {
memory = { 6.GB * task.attempt }
time = { 2.h * task.attempt }
}

withName: INTERPROSCAN_DATABASE {
memory = { 6.GB * task.attempt }
time = { 4.h * task.attempt }
cpus = { 6 * task.attempt }
}

}
43 changes: 41 additions & 2 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ process {
]
}

withName: SEQKIT_SEQ {
withName: SEQKIT_SEQ_LENGTH {
ext.prefix = { "${meta.id}_long" }
publishDir = [
path: { "${params.outdir}/bgc/seqkit/" },
Expand All @@ -96,6 +96,45 @@ process {
].join(' ').trim()
}

withName: SEQKIT_SEQ_FILTER {
ext.prefix = { "${meta.id}_cleaned.faa" }
publishDir = [
path: { "${params.outdir}/protein_annotation/interproscan/" },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want the output in ${params.outdir}/protein_annotation/interproscan/ and not in ${params.outdir}/annotation/interproscan/? I'd prefer the latter, to have it all in one place regardless of DNA (pyrodigal etc.) or protein annotation (interproscan). I think it's more intuitive to search for any annotation results in a single folder.

If not, what do you think of renaming the annotation output folder to contig_annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other annotation tools in the annotation workflow are annotating CDS which also can be proteins. I would leave the annotation workflow as is because thats the baseline annotation step of the pipeline. This annotation step is more of an accessory annotation to the pipeline if the user wants more information from diff DB and (1) its technically not correct protein_annotation because those are nnot necessary proteins and (2) the plan is that we add more to this workflow (e.g. the functional annotation of those CDS) so protein annotation will no longer be valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ouput im not sure if its a good idea to add it in the same folder because those are two diff workflows

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. No strong opinion from my side. I still don't find the naming super intuitive, but it can be. Or why not cds_annotation for the CDS tools (prokka etc.) versus protein_annotation (interproscan)?

mode: params.publish_dir_mode,
enabled: { params.run_protein_annotation_interproscan },
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
ext.args = [
"--gap-letters '* \t.' --remove-gaps"
].join(' ').trim()
}

withName: INTERPROSCAN_DATABASE {
publishDir = [
path: { "${params.outdir}/databases/interproscan/" },
mode: params.publish_dir_mode,
enabled: params.save_db,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: INTERPROSCAN {
ext.prefix = { "${meta.id}_interproscan.faa" }
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved
publishDir = [
path: { "${params.outdir}/protein_annotation/interproscan/" },
mode: params.publish_dir_mode,
enabled: params.run_protein_annotation_interproscan,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
ext.args = [
"--applications ${params.protein_annotation_interproscan_applications}",
params.protein_annotation_interproscan_enableprecalc ? '' : '--disable-precalc',
params.protein_annotation_interproscan_enableresidueannot ? '' : '--disable-residue-annot',
params.protein_annotation_interproscan_disableresidueannottsv ? '--enable-tsv-residue-annot' : '',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params.protein_annotation_interproscan_disableresidueannottsv ? '--enable-tsv-residue-annot' : '',
params.protein_annotation_interproscan_disableresidueannottsv ? '' : '--enable-tsv-residue-annot',

Copy link
Contributor Author

@Darcy220606 Darcy220606 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U suggest the opposite i.e. change the default to disable it, then we wont get any output in the tsv like this !?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm no, I changed to enable it by default, i.e. if the param is false (which is default), then activate --enable-tsv-residue-annot. Or am I missing some logic of this?

"--formats tsv"
].join(' ').trim()
}

withName: PROKKA {
ext.prefix = { "${meta.id}_prokka" }
publishDir = [
Expand Down Expand Up @@ -676,7 +715,7 @@ process {

withName: AMP_DATABASE_DOWNLOAD {
publishDir = [
path: { "${params.outdir}/databases/${params.amp_ampcombi_db}" },
path: { "${params.outdir}/databases/ampcombi/" },
mode: params.publish_dir_mode,
enabled: params.save_db,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
Expand Down
28 changes: 28 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ results/
| ├── prodigal/
| ├── prokka/
| └── pyrodigal/
├── protein_annotation/
| └── interproscan/
├── amp/
| ├── ampir/
| ├── amplify/
Expand Down Expand Up @@ -74,6 +76,10 @@ ORF prediction and annotation with any of:
- [Prokka](#prokka) – open reading frame prediction and functional protein annotation.
- [Bakta](#bakta) – open reading frame prediction and functional protein annotation.

CDS domain annotation:

- [InterProScan](#interproscan) (default) – for open reading frame protein and domain predictions.

Antimicrobial Resistance Genes (ARGs):

- [ABRicate](#abricate) – antimicrobial resistance gene detection, based on alignment to one of several databases.
Expand Down Expand Up @@ -216,6 +222,23 @@ Output Summaries:

[Bakta](https://github.com/oschwengers/bakta) is a tool for the rapid & standardised annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides dbxref-rich, sORF-including and taxon-independent annotations in machine-readable JSON & bioinformatics standard file formats for automated downstream analysis. The output is used by some of the functional screening tools.

### Protein annotation

[InterProScan](#interproscan)

#### InterProScan

<details markdown="1">
<summary>Output files</summary>

- `interproscan/`
- `<samplename>_cleaned.faa`: clean version of the fasta files (in amino acid format) generated by one of the annotation tools (i.e. Pyrodigal, Prokka, Bakta or Prokke). These contain sequences with no special characters (for eg. '\*' or '-').
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved
- `<samplename>_interproscan_faa.tsv`: predicted proteins and domains using the InterPro database in TSV format

</details>

[InterProScan](https://academic.oup.com/bioinformatics/article/30/9/1236/237988?login=true) is designed to predict the protein function and and provide possible domain and motif information for the coding regions. It utilizes the InterPro database that consists of multiple sister databases such as PANTHER, ProSite, Pfam, etc. More details can be found in the [documentation](https://interproscan-docs.readthedocs.io/en/latest/index.html).
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved

### AMP detection tools

[ampir](#ampir), [AMPlify](#amplify), [hmmsearch](#hmmsearch), [Macrel](#macrel)
Expand Down Expand Up @@ -465,6 +488,11 @@ Note that filtered FASTA is only used for BGC workflow for run-time optimisation
- `<sample>/*_ampcombi.tsv`: summarised output in tsv format for each sample
- `<sample>/*_amp.faa*`: fasta file containing the amino acid sequences for all AMP hits for each sample
- `<sample>/*_mmseqs_matches.txt*`: alignment file generated by MMseqs2 for each sample

:::info
In some cases, when the AMP workflow is turned on, only summary files per sample will be created in the output folder with **NO** `Ampcombi_summary.tsv` and `Ampcombi_summary_cluster.tsv` files and hence no taxonomic classifications will be merged (if the taxonomic classification subworkflow is turned on). This can occur when strictly setting parameters which can lead to no AMP hits found in any of the samples or in only one sample. Look out for `[nf-core/funcscan] AMPCOMBI2: 0/1 file passed. Skipping AMPCOMBI2_COMPLETE, AMPCOMBI2_CLUSTER, and TAXONOMY MERGING steps.`in the stdout or `.nextflow.log` file.
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved
:::

<summary>AMP summary table header descriptions using DRAMP as reference database</summary>

| Table column | Description |
Expand Down
32 changes: 31 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ We highly recommend performing quality control on input contigs before running t
For example, ideally BGC screening requires contigs of at least 3,000 bp else downstream tools may crash.
:::

## Notes on screening tools and taxonomic classification
## Notes on screening tools, taxonomic and functional classifications

The implementation of some tools in the pipeline may have some particular behaviours that you should be aware of before you run the pipeline.

Expand All @@ -133,6 +133,18 @@ MMseqs2 is currently the only taxonomic classification tool used in the pipeline
--taxa_classification_mmseqs_db_id 'Kalamari'
```

### InterProScan

[InterProScan](https://github.com/ebi-pf-team/interproscan) is currently the only protein annotation tool that gives a snapshot of the protein families and domains for each coding region. By giving `--run_protein_annotation_interproscan`, the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/) v5.67-99.0 is by default downloaded and prepared and the input sequences will be screened against the database. You can skip database downloading by the pipeline on each run by manually downloading and extracting the files from any [InterPro version](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/) and giving the resulting directory path to `--protein_annotation_interproscan_db`.
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved

```bash
--function_interproscan_db 'path/to/InterPro_directory/'
```

:::info
By default the databases used to assign the nearest protein domain is set as `PANTHER,ProSiteProfiles,ProSitePatterns,Pfam`. An addition of other application to the list, does not guarantee that the results will be integrated correctly within `AMPcombi`.
:::

### antiSMASH

antiSMASH has a minimum contig parameter, in which only contigs of a certain length (or longer) will be screened. In cases where no hits are found in these, the tool ends successfully without hits. However if no contigs in an input file reach that minimum threshold, the tool will end with a 'failure' code, and cause the pipeline to crash.
Expand Down Expand Up @@ -258,6 +270,12 @@ The pipeline will automatically run Pyrodigal instead of Prodigal if the paramet
This is due to an incompatibility issue of Prodigal's output `.gbk` file with multiple downstream tools.
:::

:::tip

- If `--run_protein_annotation_interproscan` is given, protein and domain classifications of the coding regions are generated and the output is then integrated into the `AMPcombi parsetables` resulting table for every sample and the complete summary files e.g., `Ampcombi_summary.tsv`.
- In some cases, when the AMP workflow is turned on, only the summary files per sample will be created in the output folder with **NO** `Ampcombi_summary.tsv` and `Ampcombi_summary_cluster.tsv` files and hence no taxonomic classifications will be merged (if the taxonomic classification subworkflow is turned on). This can occur when strictly setting parameters which can lead to no AMP hits found in any of the samples or in only one sample. Look out for `[nf-core/funcscan] AMPCOMBI2: 0/1 file passed. Skipping AMPCOMBI2_COMPLETE, AMPCOMBI2_CLUSTER, and TAXONOMY MERGING steps.`in the stdout or `.nextflow.log` file.
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved
:::

### Abricate

The default ABRicate installation comes with a series of 'default' databases:
Expand Down Expand Up @@ -509,6 +527,18 @@ deepbgc_db/
└── myDetectors*.pkl
```

### InterProScan

[InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow with `--run_protein_annotation_interproscan` will download and unzip the (as of now) latest [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/) v5.67-99.0. The database downloaded can be saved in the output directory `<output_directors>/databases/interproscan/*` if the `--save_db` is turned on. Note: the download can take upto 4 hours depending on teh bandwidth.

A diifferent version of the database can be supplied to teh pipeline ba passing the InterProScan database directory to `--protein_annotation_interproscan_db path/to/interproscan_db/`. The directory can be created following with:
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved

```
curl -L https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/interproscan-5.67-99.0-64-bit.tar.gz -o interproscan_db/interproscan-5.67-99.0-64-bit.tar.gz
tar -xzf interproscan_db/interproscan-5.67-99.0-64-bit.tar.gz -C interproscan_db/
Comment on lines +534 to +535
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
curl -L https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/interproscan-5.67-99.0-64-bit.tar.gz -o interproscan_db/interproscan-5.67-99.0-64-bit.tar.gz
tar -xzf interproscan_db/interproscan-5.67-99.0-64-bit.tar.gz -C interproscan_db/
curl -L https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/interproscan-5.72-103.0-64-bit.tar.gz -o interproscan_db/interproscan-5.72-103.0-64-bit.tar.gz
tar -xzf interproscan_db/interproscan-5.72-103.0-64-bit.tar.gz -C interproscan_db/

Copy link
Contributor Author

@Darcy220606 Darcy220606 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jasmin did u test it with this DB version? I would not change this unless we test it first to make sure that doesnt break anything

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I didn't. But you're right, then I will download it to our server and test it; would be nice to have a more recent version towards the pipeline release.

Copy link
Collaborator

@jasmezz jasmezz Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the database version 5.72-103.0 and can confirm that all output files contain valid results, i.e. results/protein_annotation/interproscan/<samplename>_interproscan.faa.tsv and results/reports/ampcombi2/Ampcombi_summary.tsv. In my test case they are even identical. We can go ahead with the new version, so please commit my database update comments before merge :)


Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved
```

Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved
## Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,11 @@
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"interproscan": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"macrel/contigs": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
Expand Down
35 changes: 35 additions & 0 deletions modules/local/interproscan_download.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
process INTERPROSCAN_DATABASE {
tag "interproscan_database_download"
label 'process_medium'

conda "conda-forge::sed=4.7"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/curl:7.80.0' :
'biocontainers/curl:7.80.0' }"

input:
val database_url

output:
path("interproscan_db/*") , emit: db
path "versions.yml" , emit: versions
Darcy220606 marked this conversation as resolved.
Show resolved Hide resolved

when:
task.ext.when == null || task.ext.when

script:
"""
mkdir -p interproscan_db/

filename=\$(basename ${database_url})

curl -L ${database_url} -o interproscan_db/\$filename
tar -xzf interproscan_db/\$filename -C interproscan_db/

cat <<-END_VERSIONS > versions.yml
"${task.process}":
tar: \$(tar --version 2>&1 | sed -n '1s/tar (busybox) //p')
curl: "\$(curl --version 2>&1 | sed -n '1s/^curl \\([0-9.]*\\).*/\\1/p')"
END_VERSIONS
"""
}
5 changes: 5 additions & 0 deletions modules/nf-core/interproscan/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

66 changes: 66 additions & 0 deletions modules/nf-core/interproscan/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading