-
Notifications
You must be signed in to change notification settings - Fork 0
4. Functional annotation with Prodigal and MicrobeAnnotator
Unitigs have various sizes. Some will be too short to try to find gene sequences. Thus, a filter on the size is necessary. You can use seqkit:
seqkit seq --min-length 1000 case_kmers.unitigs.fa > case_unitigs.1000.fa
The mean size of a bacterial gene is 1000bp, hence the threshold. You could lower the threshold; but keep in mind that a unitig could hold a partial gene sequence. Lowering the threshold may result in a very partial gene sequence and then a less precise annotation.
Prodigal finds genes in sequences. It will output gene coordinates (*.gbk
) as well as the protein translation (*.faa
) of the genes detected.
The protein sequences obtained from Prodigal are annotated by MicrobeAnnotator. MicrobeAnnotator will look for annotations recursively in several databases. The results will be a matrix of pathway completions, a barplot of pathways complete at more than 50%, and a heatmap with pathways clustered by completion values. This makes the comparison of conditions easier by contrasting them.
The files produced by this step are a metabolic_summary__heatmap.pdf
, a metabolic_summary__barplot.pdf
, and metabolic_summary__module_completeness.tab
. metabolic_summary__barplot.pdf
shows a barplot with the number of pathways complete at >= 50%, colored by category. metabolic_summary__heatmap.pdf
shows the pathways complete at >= 80% for each condition with a gradient coloration depending on the completeness, clustered by completeness, with pathway names. This file is particularly useful to determine which pathways differentiate the conditions the most.
To make the output easier to manipulate and look at in detail, we provide tables for each condition that show which unitig contains which gene and its translation, function and KO number. Looks like this:
Unitig ID | Unitig seq | Gene ID | Gene seq | Gene funciton | KO |
---|---|---|---|---|---|
Unitig1 | ACGTCGCT | Gene1 | APWHLE | Glucose transferase | K00001 |
Unitig1 | ACGTCGCT | Gene2 | WGGH | Protease | K00004 |
Unitig2 | GTCGATCATG | Gene1 | KLMF | Oxydase | K00761 |
Where Unitig1 sequence contains two genes (Gene1 and Gene2), which translated sequences are given, as well as their respective functions and KO terms.