DASHit
are programs for automated design of CRISPR guides. There are two main workflows,
- DASHit-reads: guide design from reads (e.g., from a FASTQ file), and,
- DASHit-seq: Guide design by specifying a target sequence.
Note: DASHit
is installed and ready to use on a compute
server. Please read DASHit server to learn how to connect. If
necessary, we can also install DASHit
locally on your laptops.
<<DASHit-reads>>
DASHit-reads
designs CRISPR guides by analyzing the output of an NGS run. Guides are chosen to hit the largest possible number of reads in the input.
The following runs through automated guide design on
NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fastq
, assuming that this
file has been placed in /scratch/example/
on the DASHit
server.
- Connect to
dashit.czbiohub.org
and enter the examples directorycd /scratch/example/DASHit-reads/
- Convert input FASTQ -> FASTA
seqtk seq -A NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fastq > NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta
- Run
crispr_sites
to build a map of crispr sites->reads in which they appearcat NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta | crispr_sites -r > NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_sites_to_reads.txt
*Note*: The new option
The format of
NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_sites_to_reads.txt
isAAAAAAAAAAAAAAAAAGAC 13109137 4177172 6275108 12688285 ...
The first column is a crispr site that was found, the remaining numbers on the line are the indices of the reads in which that crispr site appears. The read index is just the order of the read in the input FASTA file.
- Run
optimize_guides
to find the list of guides that best covers the inputoptimize_guides NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_sites_to_reads.txt 100 1 > NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv
The integer parameter
100
specifies how many guides to create, and the integer parameter1
specifies how many times to attempt to cover each read with CRISPR sites.You can also optionally tell
optimize_guides
about the original input FASTA file in which case it will also output a random read hit by each site. This is useful to, for example, BLAST the reads that are being hit and sanity check that the guides are hitting what you think they’re hitting.optimize_guides NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_sites_to_reads.txt 100 1 NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta > NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv
score_guides
compares a list of guides against a FASTQ file and computes how many reads are hit by the guides. Continuing the example from above, we run
score_guides NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta
Scoring guides against NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta 100 guides in NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv vs. NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta hit 5239845/16617695 = 31.53% Parsing FASTA took 11m8.358304821s
Note: Guide scoring currently takes a long time, e.g. in the above
example scoring 100 guides against a 3.4 GB FASTA file took 11
minutes. If this becomes burdonsome, we can work on making it faster,
either by adapting the crispr_sites
code or using the
_sites_to_reads.txt
score_guides
can split the input FASTA file, writing out the reads that were hit by guides and those that weren’t to two separate files. To do this, specify the -s
option:
score_guides NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta -s
Scoring guides against NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta 100 guides in NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv vs. NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta hit 5239845/16617695 = 31.53% Parsing FASTA took 11m14.83192505s Wrote NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_dashed.fasta Wrote NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_undashed.fasta
We can use score_guides
to check that the DASHed and unDASHed FASTA files are correct:
score_guides NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_dashed.fasta
Scoring guides against NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_dashed.fasta 100 guides in NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv vs. NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_dashed.fasta hit 5239845/5239845 = 100.00% Parsing FASTA took 55.039976945s
score_guides NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_undashed.fasta
Scoring guides against NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_undashed.fasta 100 guides in NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv vs. NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_undashed.fasta hit 0/11377850 = 0.00% Parsing FASTA took 10m11.982396976s
score_guides
outputs some (hopefully) helpful progress and diagnostic output to stderr
. Only a single summary line is output to stdout
, though, so if you’re scoring many FASTA files, redirecting stdout
will give you a nice output with one single line per FASTA, e.g.
for i in `ls *.fasta`; do score_guides NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv $i >> scoring; done
cat scoring
100 guides in NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv vs. NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_dashed.fasta hit 5239845/5239845 = 100.00% 100 guides in NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv vs. NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001.fasta hit 5239845/16617695 = 31.53% 100 guides in NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_guides.csv vs. NID0092_CSF_ATGTCAG-GGTAATC_L00C_R1_001_undashed.fasta hit 0/11377850 = 0.00%
<<DASHit-seq>>
DASHit-seq
designs CRISPR guides by trying to cover
an input sequence with guides, subject to not spacing them too closely
together or too far apart. Before being selected, the list of
candidate guides is filtered down to remove guides for poor structural
reasons and if they match a use specified list of off targets.
As an example, we re-run automated guide design on mouse RN45s, with off targets given by the mouse transcriptome (with RN45s removed).
- Connect to
dashit.czbiohub.org
and enter the examples directorycd /scratch/example/DASHit-seq/
- Run
DASHit-seq
, specifying the input mouse RN45s sequence and the off target mouse transcriptome (with RN45s removed)dashit-seq.py rn45s-long.fa --offtarget mouse_transcriptome_sans_rn45s > mouse_rn45s_guides.csv
DASHit-seq
will output the designed guides to the specified CSV file. The CSV file also contains some metadata and lists the guides that were filtered due to structural reasonsDesigned CRISPR guides TTGCTGCGGAGCATGTGGCT CCCCAGTCAAACTCCCCACC CTCCAACCGGCCGTCCCCGA AACGAAACGAGACACGTGTG TTCACCTTGGAGACCTGCTG GGCAAGACAGTTACTGATAC ...
CRISPR guides that were removed from consideration CRISPR site, why it was excluded AGAGAGGCGACGGAGGGGGG, homopolymer>5 GGTGGGTTCCCACGGGGCAC, hairpin:-----GTTCCC---GGGCAC GACACTCGGGGGGCCGGCGG, gc_frequency; homopolymer>5 AAATGCACGCATCCCCCCCC, homopolymer>5; dinucleotide_repeats>3 TCCCCCCCCCAACCACCACA, homopolymer>5; dinucleotide_repeats>3 AAGACCCGAGCCCGGCGCGC, gc_frequency
The guide design from this example was previously done in a bespoke
fashion by Josh. Josh’s guides are available in
/scratch/example/DASHit-seq/RN45s-classic-guides.txt
, so you can
compare the guides DASHit-seq
creates with the ones that Josh made.
DASHit-seq
will output a BED
file indicating where the designed
guides hit the input file. If you open the original input FASTA plus
the BED file in a sequence viewer, for instance in IGV, you can
visualize how the designed guides tile across the input sequence.
The parameters for offtarget filtering and poor structure filtering can be controlled from the command line. For detailed instructions on how to specify the filtering parameters, simply run
dashit-seq.py -h
You can control how precisely guides need to match off targets to be disqualified by the --offtarget_radius
option:
--offtarget_radius OFFTARGET_RADIUS Radius used for matching an off target. Specify this as L_M_N which means filter a guide for hitting an off target if L, M, N nucleotides in the first 5, 10 and 20 positions of the guide, respectively, match the off target. e.g., 5_10_20 to require perfect matches; 5_9_18 to allow one mismatch in positions 6-10 positions and to allow 2 mismatches in the last 10 positions
To control how guides are disqualified for poor structure reasons, you can specify the following options
filtering options: these options control how guides are filtered for poor structure reasons --gc_freq_min GC_FREQ_MIN filter guide if # of Gs or Cs is strictly less than this number --gc_freq_max GC_FREQ_MAX filter guide if # of Gs or Cs is strictly greater than this number --homopolymer HOMOPOLYMER filter guide if strictly more than this number of a single consecutive nucleotide appears, e.g., AAAAA --dinucleotide_repeats DINUCLEOTIDE_REPEATS filter guide if strictly more than this number of a single dinucleotide repeats occur, e.g. ATATAT --hairpin_min_inner HAIRPIN_MIN_INNER filter guide if a hairpin occurs with >=this inner hairpin spacing, e.g., oooooIIIooooo, where the o are reverse complements and III is the inner hairpin spacing --hairpin_min_outer HAIRPIN_MIN_OUTER filter guide if a hairpin occurs with >=this outer hairpin width, e.g., oooooIIIooooo, where the o are reverse complements and ooooo is the outer hairpin
The off target file specified for DASHit-seq
is simply a list of CRISPR sites that occur in the off target sequence you want to avoid. This file can be prepared using crispr_sites
in special_ops_crispr_tools
.
- Convert the off target sequence into a FASTA file, say
offtarget.fasta
- Run
crispr_sites
on the off target FASTA filecat offtarget.fasta | crispr_sites > offtargets.txt
- Run
Note: Currently crispr_sites
is also outputting additional information about each site which is used by DASHit-reads
. You won’t need this information for use as offtarget, so you should open the offtargets.txt
generated above and delete everything but the first column.
<<server>>
David has setup dashit.czbiohub.org
, an EC2 instance we can use for
guide design that has all the DASHit
software pre-installed. For now
this makes things easier by not requiring you to install development
packages on your local machines. If running DASHit
remotely becomes
too annoying, then talk to David about getting DASHit
setup on your
local machine.
- Get an SSH key
id_dashit
from David or Emily, and place it in~/.ssh/
- Set the permissions on the SSH key appropriately
chmod 600 ~/.ssh/id_dashit
- SSH in
ssh -i ~/.ssh/id_dashit dashit@dashit.czbiohub.org
The /home/dashit
home directory is on a small 8GB partition used for
the operating system. A 500 GB partition is mounted as /scratch
, so
use that for your FASTQ
, etc files.
You can:
- Upload input files to
/scratch
directly, or, - Run
aws cp
on the dashit server to copy directly from S3.- To do this you’ll need to setup your AWS credentials on
dashit.czbiohub.org
- To do this you’ll need to setup your AWS credentials on
DASHit-seq
is implemented as a literate programming file which you can view here here.- You can view the
DASHit-reads
programs,crispr_sites
andoptimize_guides
, here and here.
- Install
go
brew install go
nix-env -i go