Skip to content

4. Functional annotation with Prodigal and MicrobeAnnotator

Louis-Mael Gueguen edited this page Sep 6, 2024 · 6 revisions

Filter by size

Unitigs have various sizes. Some will be too short to try to find gene sequences. Thus, a filter on the size is necessary. You can use seqkit:

seqkit seq --min-length 1000 case_kmers.unitigs.fa > case_unitigs.1000.fa

The mean size of a bacterial gene is 1000bp, hence the threshold. You could lower the threshold; but keep in mind that a unitig could hold a partial gene sequence. Lowering the threshold may result in a very partial gene sequence and then a less precise annotation.

Prodigal 🔍

Prodigal finds genes in sequences. It will output gene coordinates (*.gbk) as well as the protein translation (*.faa) of the genes detected.

MicrobeAnnotator 🏷️

The protein sequences obtained from Prodigal are annotated by MicrobeAnnotator. MicrobeAnnotator will look for annotations recursively in several databases. The results will be a matrix of pathway completions, a barplot of pathways complete at more than 50%, and a heatmap with pathways clustered by completion values. This makes the comparison of conditions easier by contrasting them.

The files produced by this step are a metabolic_summary__heatmap.pdf, a metabolic_summary__barplot.pdf, and metabolic_summary__module_completeness.tab. metabolic_summary__barplot.pdf shows a barplot with the number of pathways complete at >= 50%, colored by category. metabolic_summary__heatmap.pdf shows the pathways complete at >= 80% for each condition with a gradient coloration depending on the completeness, clustered by completeness, with pathway names. This file is particularly useful to determine which pathways differentiate the conditions the most.

Summary table

To make the output easier to manipulate and look at in detail, we provide tables for each condition that show which unitig contains which gene and its translation, function and KO number. Looks like this:

Unitig ID Unitig seq Gene ID Gene seq Gene funciton KO
Unitig1 ACGTCGCT Gene1 APWHLE Glucose transferase K00001
Unitig1 ACGTCGCT Gene2 WGGH Protease K00004
Unitig2 GTCGATCATG Gene1 KLMF Oxydase K00761

Where Unitig1 sequence contains two genes (Gene1 and Gene2), which translated sequences are given, as well as their respective functions and KO terms.