-
Notifications
You must be signed in to change notification settings - Fork 25
Workflow Examples
- Graph Construction
- Merge Graphs
- Error Cleaning
- Read Threading
- Merge Paths
- Call Variants
- Compare to a Reference
- Assemble Contigs
- Filter Subgraph
- Filter Reads
- Plot Graph
Construct graphs for three samples (using 70GB ram):
ctx31 build -m 70G -k 31 --sample NA12878 --seq input.fq.gz NA12878.ctx
ctx31 build -m 70G -k 31 --sample Mickey --seq2 reads.1.fq.gz reads.2.fq.gz Mickey.ctx
ctx31 build -m 70G -k 31 --sample Minnie --seq data.bam Minnie.ctx
Construct graph for the reference (hg19)
ctx31 build -m 70G -k 31 --sample hg19 --seq hg19.fa.gz hg19.ctx
Merge multiple graphs together. Graphs are kept as separate colours unless the --flatten
option is used. To combine three graphs into refAndSamples.ctx:
ctx31 join --out refAndSamples.ctx NA12878.ctx Mickey.ctx Minnie.ctx
A) 'Clean' graphs to remove sequencing error (per sample, for high coverage samples):
ctx31 clean NA12878.clean.ctx NA12878.ctx
ctx31 clean Mickey.clean.ctx Mickey.ctx
ctx31 clean Minnie.clean.ctx Minnie.ctx
...then merge graphs into file refAndSamples.ctx
. Uses 80GB ram:
ctx31 join -m 80G refAndSamples.clean.ctx hg19.ctx NA12878.clean.ctx Mickey.clean.ctx Minnie.clean.ctx
B) Alternatively merge uncleaned samples then clean on the population (multiple low depth samples). Uses 100GB ram:
ctx31 clean -m 100G --out samples.clean.ctx NA12878.ctx Mickey.ctx Minnie.ctx
ctx31 join -m 80G refAndSamples.ctx hg19.ctx samples.clean.ctx
Now we have cortex graphs of the reference and our samples with sequencing error removed.
The command ctx31 clean
has the option of writing out histograms of supernode lengths and coverages with the --covg-before
, --covg-after
, --len-before
and --len-after
arguments. You can plot the results of these using a pair of scripts bundled with cortex in the scripts/
directory (requires R
and R packages ggplot2
and gridExtra
):
R --vanilla --file=scripts/plot-covg-hist.R --args covg.before_cleaning.csv covg.before_cleaning.pdf
R --vanilla --file=scripts/plot-covg-hist.R --args covg.after_cleaning.csv covg.after_cleaning.pdf
R --vanilla --file=scripts/plot-length-hist.R --args length.before_cleaning.csv length.before_cleaning.pdf
R --vanilla --file=scripts/plot-length-hist.R --args length.after_cleaning.csv length.after_cleaning.pdf
Before threading reads through the graph, you must run inferedges
:
ctx31 inferedges -m 50G in.ctx
This edits in.ctx
to add missing edges between kmers. Later steps assume all edges between kmers already exist.
Thread reads through the graph to assist in assembly, generating file refAndSamples.ctp
:
ctx31 thread -m 80G --seq2 NA12878.1.fq NA12878.2.fq --out NA12878.ctp refAndSamples.ctx:1
ctx31 thread -m 80G --seqi Mickey.bam --out Mickey.ctp refAndSamples.ctx:2
ctx31 thread -m 80G --seqi Minnie.sam --out Minnie.ctp refAndSamples.ctx:3
Now the path files NA12878.ctp
, Mickey.ctp
and Minnie.ctp
are ready to be used alongside their respective graphs, e.g.:
ctx31 contigs -m 100G -p 1:NA12878.ctp -p 2:Mickey.ctp -p 3:Minnie.ctp refAndSamples.ctx
Note: the reference is colour 0 in the above example.
## Merge Path FilesSimilar to the join
command, pjoin
merges path files.
ctx31 pjoin -o refAndNA12878.ctp ref.ctp NA12878.ctp
ctx31 bubbles -m 100G --out denovo.bubbles.gz --haploid 0 -p refAndSamples.ctp refAndSamples.ctx
Uses the graph and paths (refAndSamples.ctx
, refAndSamples.ctp
) to call variants that are saved to the file denovo.bubbles.gz
. We also specify that the first colour is a haploid sample (with --haploid 0
). This is useful in removing bubbles caused by repeats rather than real variants.
ctx31 unique denovo.bubbles.gz denovo
Creates files denovo.vcf
and denovo.5pflank.fa
. We then map the putative variant flanks. We can do this with stampy:
stampy.py -G hg19 hg19.fa # create hg19.stidx
stampy.py -g hg19 -H hg19 # create hg19.sthash
stampy.py -g hg19 -h hg19 --inputformat=fasta -M denovo.5pflank.fa > denovo.5pflank.sam
or BWA:
bwa index hg19.fa
bwa mem hg19.fa denovo.5pflank.fa > denovo.5pflank.sam
Then we align the alleles to the reference and generate a VCF
ctx31 place denovo.vcf denovo.sam hg19.fa > denovo.placed.vcf
This generates a VCF of sites. Genotyping variants is not yet implemented.
## Compare to a ReferenceIf you have an assembled genome, you can compare your sequenced samples to it to identify large variants and re-arrangements:
ctx31 breakpoints -m 100G --seq hg19.fa --out breakpoints.txt.gz -p PlutoAndGoofy.ctp PlutoAndGoofy.ctx
The breakpoints command identifies where your samples diverge from the reference and rejoin it elsewhere. This could be small regions of clustered SNPs, inversions or translocations.
Results can be plotted on a Circos plot using a perl script in scripts/
:
gzip -dc breakpoints.txt.gz | scripts/make-circos.pl out-dir -
cd out-dir
circos
Plot appears in out-dir/circos.png
.
Pull out contigs from an individual in a population. Fills in low coverage gaps in sequence from the population if unconfounded.
ctx31 contigs -p refAndSamples.ctp --colour 1 --ncontigs 1000 --print --out refAndSamples.contigs.fa refAndSamples.ctx
Specify how many contigs you want with --ncontigs 1000
.
The contigs
command uses random kmers to seed contigs, therefore it prints duplicates contigs and some contigs that are substrings of others. To remove duplicate contigs you can use the rmsubstr
command:
ctx31 rmsubstr -m 10G --out refAndSamples.contigs.rmdup.fa refAndSamples.contigs.fa
Pull out a region of interest using a set of reads or kmers. The subgraph command searches out from the seed kmers by dist
kmers (dist >= 0
).
ctx31 subgraph --seq contig.fa contig.ctx 100 graph.ctx
Input is graph.ctx
, output is saved to contig.ctx
. You can specify particular colours on the input to keep in the output. Add the --invert
option to invert the kmers selected - only take those further than dist
kmers.
Filter reads by whether or not they share a kmer with a graph:
ctx31 reads [--fasta|--fastq] --seq in1.fa out.se --seq2 in.1.fq in.2.fq out.pair in.ctx
Will save reads from in1.fa
to out.se.fa.gz
and from in.1.fq
,in.2.fq
to out.pair.1.fa.gz
,out.pair.2.fa.gz
. Output options are --fasta
or --fastq
. Paired end reads (--seq2
) are considered touching the graph if either read shares one or more kmers with the graph. Add the --invert
option to select reads that DO NOT touch the graph.
Requires GraphViz which provides the dot
command. Flattens any colours in the graph:
./scripts/cortex_to_graphviz.pl graph.ctx:0,3 | dot -Tpdf > kmers.pdf
Or to merge adjacent kmers add the --simplify
option:
./scripts/cortex_to_graphviz.pl --simplify graph.ctx:0,3 | dot -Tpdf > contigs.pdf
Or you can just use ./scripts/cortex_to_graphviz.pl
to generate dot files.
You can generate a plot of a de Bruijn graph straight from sequence files using scripts/seq2pdf.sh
:
./scripts/seq2pdf.sh --simplify 3 <(echo ACAACACGT) <(echo CCACACAA) > out.pdf
Usage is: ./scripts/seq2pdf.sh [--simplify] <kmer> <file1> [file2] > out.pdf
. Files can be of types FASTA, FASTQ, SAM, BAM, TXT or even gzipped. Requires graphviz.