Skip to content

3. Taxonomic assignment of unitigs with Kraken2

Louis-Mael Gueguen edited this page Oct 10, 2024 · 1 revision

Kraken2 is a tool that assigns sequences to taxon using k-mers. Its advantage is to assign sequences 'as precisely as possible'.

Its output is decomposed in two files. The kraken_[case/control].output:

C    Bacteroides (taxid 816)    41    0:1 2637548:5 0:2
U    unclassified (taxid 0)    35    360807:1

This file is used to build the table of unitigs to functions and clades.

The second file is the kraken_[case/control].report:

 62.48  10289046        10289046        U       0       unclassified
 37.52  6178690 8427    R       1       root
 37.36  6151918 2537    R1      131567    cellular organisms
 37.30  6142656 107232  D       2           Bacteria
 15.32  2522287 25      D1      1783270       FCB group
 15.32  2522129 66      D2      68336           Bacteroidetes/Chlorobi group
 15.31  2521922 3701    P       976               Bacteroidetes
 15.24  2509784 270     C       200643              Bacteroidia
 15.24  2509043 166311  O       171549                Bacteroidales
  8.17  1345265 101820  F       815                     Bacteroidaceae

First column is the percentage of unitigs assigned to this clade, second and third are more technical (see kraken wiki). Third column is the rank of the taxa (G for Genus, S for Species ...), fourth is taxid, last is the clade. This file is interesting in itself but was primarily built for reads. Which meant that the reads were of 'equal importance/meaning'. But here the sequences are unitigs, and their length vary widely (from a single k-mer to ... no limit). Thus, a third file is build from the output: [case/control]_clade.tsv

Unclassified    512640685
Bacteroides (taxid 816) 31310507
Prevotella copri (taxid 165179) 15327700
Phocaeicola plebeius (taxid 310297)     14988657
Eubacteriales (taxid 186802)    14963744
Enterobacteriaceae (taxid 543)  12643762
Bacteroidales (taxid 171549)    11334934
Klebsiella (taxid 570)  10345553
Faecalibacterium prausnitzii (taxid 853)        10234613
Bacteria (taxid 2)      10146249

First column is the name of the clade with its taxid, second is the total length of unitigs assigned to it. Aggregating k-mers p-values to unitig p-values and then clades p-values would have been best, but most would end up being 0 due to rounding errors and the fact that k-mer p-values are already low.