Welcome to ICB!
ICBdocker is a computational pipeline for protein annotation. Using an integrative approach, it allows the user to annotate several protein features such as domains, TMHs, disorder and identity through PSIBLAST, HHpred and InterProScan.
The PSIBLAST performs an homology search of your protein on UniProt/SwissProt, which will provide information on similar hits with their Gene Ontology (GO), keywords and Enzyme Commission (EC) numbers.
HHpred consists of several tools that allows to map your sequence on the Protein Data Bank (PDB). This modules provides information on proteins that are structurally similar to your proteins of interest, including information on function, GOs and EC numbers. You can obtain from the raw data the alignments for eventual modelization using tools like MODELLER.
HHblits maps sequence domains on your proteins by creating a multiple sequence alignment based on the UniProt20 Hidden Markov Model Database. Secondary structure prediction of the sequence is added using PSIPRED, increasing the sensitivity of HHblits.
InterProScan searches for protein signatures on several databases, including PFAM, PANTHER and SUPERFAMILY among others, and results are parsed for KEGG-pathway entries and additional GO terms.
SignalP predicts the presence or absence of a signal peptide on your sequences, indicating the probability of them to be excreted.
IUPRED predicts the average level of disorder of your protein, giving information on the globular portions of the input proteins.
TMHMM predicts the presence and localization of TransMembrane Helices (TMHs) on the provided sequences, indicating with portions of the proteins are extracellular or periplasmic.
The input: your sequences in a single multiFASTA file.
UniProt header example:
>tr|A0A015K2T4|A0A015K2T4_9GLOM
GenBank header example:
>gi|497643701|ref|WP_009957885.1| GDP-fucose synthetase [Verrucomicrobium spinosum]
The output: DataTables to browse your data (example: http://pvcbacteria.org/mywiki/pipeline-tables/Blastopirellula_marina.html),
a tabular-separated-values file (Can be imported into R or Excel).
and raw data from the predictors.
ICB pipeline runs on every system architecture and OS supported by Docker. The pipeline supports multicore systems.
While the Docker container can run even on a laptop, the pipeline requires a discrete amount of hard disk space (~130GB) due to databases size and was built with server-side processing in mind.
The setup is quite easy.
- Install Docker for your system.
- Pull the ICB image from DockerHub.
- Point the container to your data and run it.
- Run ICB.
- Enjoy!
Get Docker for your OS from https://store.docker.com/search?type=edition&offering=community.
To pull the image
docker pull bordin89/icb
The container contains already all the tools and databases. You'll need to point the folder containing your data to the folder "/data/" . Run the container using the command
docker run -it -v /path/in/my/computer/to/data:/data/ bordin89/icb
an example:
docker run -it -v /cluster/data/proteomes/e_coli_proteome.fasta:/data/ bordin89/icb
Enter the folder containing your data.
cd data/
To see the main script helper, type
python /ICB_docker/icb.py -h
You can run all the modules at once, or just the ones you need.
python /ICB_docker/icb.py -i my_sequences.fasta -all
python /ICB_docker/icb.py -i my_sequences.fasta -tmh -hhblits -hhpred -iup -psiblast
Modules and related command-line parameters
PSIBLAST (-psiblast)
HHPred on PDB (-hhpred)
HHblits on UniProt (-hhblits)
InterProScan (-ipr)
SignalP (-sigp)
IUPRED (-iup)
TMHMM (-tmh)
A config.txt file is located at /ICB_docker/config.txt
. Modify the "cpus = 2" parameter according to your machine.
Here you can set the amount of cores available for PSIBLAST, HHpred and HHblits.
The same parameter needs to be passed to the container before launch.
docker run -it -v /cluster/data/e_coli_proteome.fasta:/data/ --cpus 10 bordin89/icb
Your data is available in the same input folder. Here you can find a tsv (tab-separated-values) and the HTML with the
DataTables. You can also find the raw data for the predictors organized in archives and folders.