-
Notifications
You must be signed in to change notification settings - Fork 0
3. Taxonomic assignment of unitigs with Kraken2
Kraken2 is a tool that assigns sequences to taxon using k-mers. Its advantage is to assign sequences 'as precisely as possible'.
Its output is decomposed in two files. The kraken_[case/control].output:
C Bacteroides (taxid 816) 41 0:1 2637548:5 0:2
U unclassified (taxid 0) 35 360807:1
This file is used to build the table of unitigs to functions and clades.
The second file is the kraken_[case/control].report:
62.48 10289046 10289046 U 0 unclassified
37.52 6178690 8427 R 1 root
37.36 6151918 2537 R1 131567 cellular organisms
37.30 6142656 107232 D 2 Bacteria
15.32 2522287 25 D1 1783270 FCB group
15.32 2522129 66 D2 68336 Bacteroidetes/Chlorobi group
15.31 2521922 3701 P 976 Bacteroidetes
15.24 2509784 270 C 200643 Bacteroidia
15.24 2509043 166311 O 171549 Bacteroidales
8.17 1345265 101820 F 815 Bacteroidaceae
First column is the percentage of unitigs assigned to this clade, second and third are more technical (see kraken wiki). Third column is the rank of the taxa (G for Genus, S for Species ...), fourth is taxid, last is the clade. This file is interesting in itself but was primarily built for reads. Which meant that the reads were of 'equal importance/meaning'. But here the sequences are unitigs, and their length vary widely (from a single k-mer to ... no limit). Thus, a third file is build from the output: [case/control]_clade.tsv
Unclassified 512640685
Bacteroides (taxid 816) 31310507
Prevotella copri (taxid 165179) 15327700
Phocaeicola plebeius (taxid 310297) 14988657
Eubacteriales (taxid 186802) 14963744
Enterobacteriaceae (taxid 543) 12643762
Bacteroidales (taxid 171549) 11334934
Klebsiella (taxid 570) 10345553
Faecalibacterium prausnitzii (taxid 853) 10234613
Bacteria (taxid 2) 10146249
First column is the name of the clade with its taxid, second is the total length of unitigs assigned to it. Aggregating k-mers p-values to unitig p-values and then clades p-values would have been best, but most would end up being 0 due to rounding errors and the fact that k-mer p-values are already low.