Skip to content
Isaac Turner edited this page Aug 27, 2014 · 4 revisions

The core data structure of McCortex is a de Bruijn Graph, which is a graph where nodes are kmers (words of length k) and edges exist between all kmers which overlap by k-1 characters. k is fixed for any given graph. The kmers in genome assembly use bases (i.e. the DNA letters A,C,G,T).

An example with k=5:

ACCAT -> CCATC

which has two nodes (ACCAT,CCATC) and one edge between them. Traversing this graph from left to right we assemble the sequence ACCATC. From right to left we assemble the reverse complement GATGGT.

Each node in the McCortex de Bruijn graph represents both the kmer and is reverse complement. We refer to this representation as the kmer-key (lexically lowest of the two). Nodes can then be traversed with an orientation FORWARD (the kmer-key) or REVERSE (the reverse complement of the kmer-key).

Data structures

There are various ways to represent a de Bruijn graph in memory, the most obvious being a hash table where the key is the kmer-key. More compact representations have been developed and present different pay offs in speed and required memory.

More information on de Bruijn graphs:

Clone this wiki locally