This is a demonstration project for CSE185. It implements a smaller, simpler version of bwa-backtrack. See the BWA page for more details. For the materials that I actually turned in summarizing this project, refer to the final-project-files directory.
Installation doesn't require any additional libraries.
Navigate to the directory in which you would like to download this tool and use the following command:
git clone https://github.com/WillardFord/wf-align-CSE185
Change into that directory and install the tool so that it can be used from the command line.
You can install wf-align
with the following command:
cd wf-align-CSE185
python setup.py install
Note: if you do not have root access, you can run the commands above with additional options to install locally:
python setup.py install --user
If the install was successful, typing wf-align --help
should show a useful message.
The basic usage of wf-align
is:
wf-align reference.fa reads.fq [-o output.sam] [other options]
To run wf-align
on a small test example (using files in this repo):
wf-align example-files/test_reference.fa example-files/test_reads.fastq
This should produce the output below:
@HD VN:1.6 SO:unknown
SEQ_ID_1 0 chrTEST 3 255 10M 0 0 10 CTAGCTACGT FFFFFFFFFF
SEQ_ID_2 0 chrTEST2 1 255 10M 0 0 10 TAGCTAGGTT HHHHHHHHHH
SEQ_ID_3 0 chrTEST 57 255 8M 0 0 8 GCTAGCAT HHHHHHHH
There are 2 required inputs to wf-align
, a reference fasta file and a fastq file containing reads. Users may additionally specify the options below:
-
-o FILE
,--output FILE
: Write output to file. By default, output is written to stdout. -
-m FILE
,--metrics FILE
: Write metrics to file. By default, metrics are written to{cur_time}_wf_align_metrics.txt
where cur_time is the result oftime.time()
at the end of alignment.
I've benchmarked wf-align against BWA-MEM using a SARS Cov2 reference genome in the benchmark file.
The output file format is the same as the bwa mem method, a sam file. See: https://samtools.github.io/hts-specs/SAMv1.pdf
I used a modified version of BWA-backtrack algorithm that runs in O(length of reference genome * length of reads) using O(length of reference genome) space. For more details refer to the methods section of the Project Report available in the final-project-files directory.
Available in final-project-files/Project-Report.pdf
This repository was generated by Willard Ford, with inspiration from the CSE 185 Example Repository and the work of my fellow students.
Please submit a pull request with any corrections or suggestions.