The DBR Processing Pipeline detects and removed PCR duplicates from paired-end Illumina sequence data that have been generated with DBR tags in read 2.
This pipeline can be used for reference-mapped or de novo assembly; however, for de novo assembly, a pseudoreference must be generated. A pseudoreference sequence is the consensus sequence generated by a de novo assembly. The
There are two main components to the pipeline for processing DBR-tagged sequences:
-
assembled_DBR_filtering.py generates a hash table associating Illumina Hi-Seq sequence IDs with their associated DBRs.
-
integrated_denovo_pipeline.py generates the initial DBR_dict from the raw sequence data, and also performs other optional quality filtering and assembly operations. Users can substitute any other assembly methods as long as they generate SAM files as output, as DBR filtering is dependent on the existence of SAM files.
TODO: Move DBR_dict function from integrated_denovo_pipeline.py to assembled_DBR_filtering.py. This reflects a better organization: optional pipeline components for assembly are in integrative_denovo_pipeline.py and the DBR specific components are in assembled_DBR_filtering.py.
- Run DBR_dict (integrated_denovo_pipeline.py)
- Assemble sequences using any pipeline that produces output in SAM format. Additional functions in integrated_denovo_pipeline.py will produce output in SAM format, but their use is not required.
- Run assembled_DBR_filter.py to identify over-representation of DBR sequences in SAM files and remove those sequences. Output the resulting data to FASTA format.
- Re-run the assembly pipeline on the DBR-filtered dataset to produce a robust assembly with PCR duplicates removed.
- Python 2.7
- Linux (untested on OS X but likely to work; untested on Windows)
- Optional software for running assembly functions in integrative_denovo_pipeline.py
- FastX toolkit
- Pear
- Stacks
- BWA
- Samtools
- Vcftools
These are stand-alone scripts; no separate installation is necessary. The scripts import Python modules that will be part of any basic Python installation.
Please see original documentation for installation instructions for external software.
- PEAR
- FASTX Toolkit
- Stacks
- BWA
- Samtools/Vcftools/Bcftools
- Raw Illumina HiSeq paired-end data with DBR sequences in read 2. Quality scores should be coded in ASCII-33 format.
- Assembled sequence data in SAM format. Each SAM file should have A. The sample ID as part of the file name B. If data from multiple libraries are to be used, the library name (or any unique identifier) in the SAM file name.
- Barcode sequences for demultiplexing, with the library name (or any unique identifier) in the file name. This must match the library name found in the SAM file.
-
DBR_Filter
Function overview
This function performs the main filtering to remove PCR duplicates based on the DBR tags.
The assembled SAM file for a single sample is loaded, and regular expressions are used to find a load the barcode file and DBR dictionary (as produced by DBR_dict) that contain data for that sample. The DBR_dict created from the raw data has the following structure
{Sequence ID : DBR}
The SAM file is read line-by-line, and a set of two additonal dictionaries for that SAM file are created:
A. assembly_dict_2: This dictionary is used to track the number of times each unique DBR occurs as reference mapped to the same location in the reference.
{RNAME : {DBR: count}}
B. assembly_dict_3: This dictionary is used to track the ID and quality for each assembled sequence. Sequences that pass the DBR filtering threshold will be extacted from this dictionary and saved as a new .fasta file (see "n_expected" below for details on the DBR filtering threshold).
{RNAME : {DBR: {Sequence ID, QUAL, SEQ}}}
Glossary
Name Definition Sequence ID Illumina HiSeq ID, listed under "QNAME" in SAM file. RNAME The reference sequence ID to which the individual sequences map. (RNAME is a SAM file header) DBR The DBR sequence retrieved by matching QNAME from the SAM file to the Sequence ID from the DBR_dict. count The number of occurrences of a specific DBR in sequences mapped to the same reference sequence. QUAL The ASCII quality sequence corresponding to the Sequence ID. (QUAL is a SAM file header) SEQ The actual sequence corresponding to the Sequence ID. (SEQ is a SAM file header) Function operation
Argument Help assembled_dir
Full path to directory containing SAM files to filter. out_dir
Full path to output .fasta file for saving filtered sequences and the run logfile. n_expected
Integer. maximum number of replicate DBRs tolerated. barcode_dir
The full file path to the directory containing the unique sample barcodes by which samples are demultiplexed. dict_dir
The full file path to the directory containing the DBR dictionaries created by DBR_dict. barcode_file = None
Deprecated. Used for analyzing only 1 library. test_dict = True
Logical. If True, print samples of dictionary entries to check for proper formatting. phred_dict = phred_dict
Globally defined variable; dictionary type. Converts ASCII-33 quality scores to integers which can be used in a median calculation. ASCII-64 is not supported. samMapLen = None
Integer or None. Expected length of map matches in the SAM file.
Users do not manipulate these functions directly. They are instead called by DBR_Filter to parse input data for the filtering process.
-
qual_median
Calculates the median quality score of a sequence. It is called by DBR_Filter during the filtering process.
Argument Help QUAL
ASCII quality string extracted from SAM file. phred_dict
Globally defined variable; dictionary type. Converts ASCII-33 quality scores to integers which can be used in a median calculation. ASCII-64 is not supported. -
find_SampleID
Extracts the sample ID from a SAM file name using a regular expression. That regular expression is hard-coded as
\d{1,3}T?
This identifies 1-3 digits for the sample ID optionally followed by a "T" indicating samples which are technical replicates.
Future versions of the software will allow for this regular expresson to be user-specified.
Argument Help filename
SAM file name. -
find_LibraryID
Extracts the library ID from a SAM file name using a regular expression. That regular expression is hard-coded as
Library\d{1,3}[A|B]?
This identifies the word "Library" followed by 1-3 digits for the library ID, optionally followed by an "A" or "B" for libraries that were split across multiple lanes.
Future versions of the software will allow for this regular expresson to be user-specified.
Argument Help filename
SAM file name. -
find_BarcodeFile
Argument Help library
The value returned by find_LibraryID. directory
The full file path to the directory containing the unique sample barcodes by which samples are demultiplexed. -
find_DBRdictionary
Argument Help library
The value returned by find_LibraryID. directory
The full file path to the directory containing the DBR dictionaries created by DBR_dict.
-
parallel_DBR_dict
Passes arguments to DBR_dict to create DBR dictionaries for all .fastq files in a directory.
Recommended to run on quality filtered data.
Argument Help in_dir
Full path to directory containing a set of .fastq files. seqType
'read2' or 'pear'. If 'read2', expect only read 2 files in in_dir
. If 'pear', expect only read 1 and read 2 merged with PEAR in "in_dir".dbr_start
Integer. Index of first base in DBR. NOTE: Python indexing starts at 0. dbr_stop
Integer. Index of first base after end of DBR. EXAMPLE: If DBR ends at base position 9, dbr_stop = 10. test_dict = False
Logical. If True, print samples of dictionary entries to check for proper formatting. save = None
Optional text string suffix to append to output file name. DBR dictionaries inherit the name of the file used to create them, plus an additional (optional) user-specified text string, and an automatically added .json extension. If save = None, the DBR dictionary file name will be the original file name with a .json extension. -
DBR_dict
Create a DBR dictionary for a .fastq file.
Recommended to run on quality filtered data.
Argument Help in_file
Full path to a single .fastq file dbr_start
Integer. Index of first base in DBR. NOTE: Python indexing starts at 0. dbr_stop
Integer. Index of first base after end of DBR. EXAMPLE: If DBR ends at base position 9, dbr_stop
= 10.test_dict = False
Logical. If True, print samples of dictionary entries to check for proper formatting. save = None
Optional text string suffix to append to output file name. DBR dictionaries inherit the name of the file used to create them, plus an additional (optional) user-specified text string, and an automatically added .json extension. If save = None, the DBR dictionary file name will be the original file name with a .json extension.
The functions below are wrappers to several external software packages. These wrappers have two purposes: (1) to facilitate batch processing of multiple input files in parallel when appropriate, and (2) to allow sequential execution of multiple steps from a single Python script to call all of the functions. Calls to these wrapper functions can be done in any order the user chooses, provided that the input and output directories are all properly specified.
- Merge overlapping reads with PEAR (only if overlap between read 1 and read 2 is expected).
- Quality filter merged reads with FASTQ Quality Filter.
- Make DBR dictionaries using integrated_denovo_pipeline.py.
- Demultiplex merged, quality filtered reads.
- Trim the demultiplexed sequences to uniform length (only if using Stacks for de novo assembly, which requires uniform read lengths).
- Run ustacks and cstacks scripts from Stacks to complete de novo assembly.
- Extract consensus sequences from Stacks catalog using the GeneratePseudoref function in integrated_denovo_pipeline.py.
- Reference map the trimmed sequences (output of step 5) to the consensus sequence extracted in step 7.
- Filter out over-represented DBRs with the DBR_Filter function in assembled_DBR_filtering.py. The output of this step is a single .fastq per library (un-demultiplexed -- will be updated to output a single .fastq per sample at some point).
- Repeat steps 4-8 using the DBR filtered data.
- Call genotypes using Samtools Mpileup.
These Python wrappers are not necessary to use the external software. Please note that these Python wrappers do not add any functionality to the external software. Furthermore, these wrappers are not intended to cover all possible usages of the external software, and in many cases their specific implementation may not be suitable for all needs.
All external software should be appropriately cited.
Please refer to the developer documentation for further detail about functionality and options.
-
parallel_PEAR_assemble
Passes arguments to PEAR_assemble to merge read pairs for all pairs in a directory.
Argument Help regexR1
Regular expression for finding Read 1 in in_dir
.regexR2
Regular expression for finding Read 2 in in_dir
.regexLibrary
Regular expression for finding the library name within the files. The library name will be extracted and used in the output file name. in_dir
Full path to directory of read 1 and read 2 .fastq files. out_dir
Full path to output directory for merged reads. out_name = 'pear_merged\_'
Prefix to prepend to library name to create output file name. Default is 'pear_merged_' but any text string is acceptable. extra_params = 'None'
Extra parameters to pass to PEAR, e.g. minimum and maximum merged read lengths. Default is no extra parameters. pearPath
Path to PEAR executable. -
Pear_assemble
Assemble overlapping reads using PEAR.
Argument Help R1
Full path to Read 1. R2
Full path to Read 2. out_dir
Full path to output directory for merged reads. out_file
Name for output file. extra_params
Extra parameters to pass to PEAR, e.g. minimum and maximum merged read lengths. Default is no extra parameters. pearPath
Path to PEAR executable. -
parallel_FASTQ_quality_filter
Passes arguments to FASTQ_quality_filter to filter .fastq files using FASTQ Quality Filter from the FASTX Toolkit.
Argument Help in_dir
Full path to directory of .fastq files for quality filtering. out_dir
Full path to output directory for filtered .fastq.gz files. out_name
Suffix to append to output file name. q
Minimum quality score to keep ( -q
in FASTQ Quality Filter documentation).p
Minimum percent of bases that must have -q
quality (-p
in FASTQ Quality Filter documentation).qualityFilter
Path to FASTQ Quality Filter executable. read = '*'
Regex for locating reads to quality filter. Default is all reads in in_dir
-
FASTQ_quality_filter
Quality filters .fastq files using FASTQ Quality Filter from the FASTX Toolkit.
Argument Help in_file
Full path to input .fastq file for quality filtering ( -i
in FASTQ Quality Filter Documentation).out_file
Full path to output file for filtered .fastq.gz files ( -o
in FASTQ Quality Filter Documentation).q
Minimum quality score to keep ( -q
in FASTQ Quality Filter documentation).p
Minimum percent of bases that must have -q
quality (-p
in FASTQ Quality Filter documentation).qualityFilter Path to FASTQ Quality Filter executable. -
parallel_Trim
Passes arguments to Trim for trimming a directory of .fastq files using FASTQ Trimmer from the FASTX Toolkit.
Argument Help in_dir
Full path to directory of .fastq files for trimming with FASTQ Trimmer. out_dir
Full path to output directory for trimmed .fastq files. trimPath
Path to FASTQ Trimmer executable. first_base
First base to keep ( -f
in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).last_base = None
Last base to keep. Default is None, meaning all bases after the first base to keep first_base
will be retained. (-l
in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).suffix = '_trimmed.fq'
Text string to be appended to output file name. Default is '_trimmed.fq' but any text string is acceptable. -
Trim
Trims a .fastq file using FASTQ Trimmer from the FASTX Toolkit.
Argument Help in_file
Full path to .fastq file for trimming with FASTQ Trimmer. out_file
Full path to output directory for trimmed .fastq files. first_base
First base to keep ( -f
in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).last_base = None
Last base to keep. Default is None, meaning all bases after the first base to keep first_base
will be retained. (-l
in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).trimPath
Path to FASTQ Trimmer executable. execute = True
Should the call to FASTQ Trimmer be added to the Trim
function's internal processing queue (execute = True
), or added to a thread-limited external processing queue (execute = False
). Selectexecute = True
when there are more available processors than there are files to trim, and all can be processed simultaneously. Selectexecute = False
when there are more files to trim than there are processors available for more efficient handling of parallel processing by external libraries. -
iterative_Demultiplex
Argument Help in_dir
Full path to directory of .fastq files to be demultiplexed. The .fastq files in this directory must have names that correspond to barcode file names such that a regular expression can be used to pair them. Example: Library1_quality_filtered.fastq. barcode_dir
Full path to directory of barcode files corresponding to .fastq files for demultiplexing. Individual barcode files in this directory should follow the format specified in the FASTX Barcode Splitter documentation. The barcode files in this directory must have names that correspond to the .fastq file names such that a regular expression can be used to pair them. Example: Library1_barcodes.txt. out_dir
Full path to directory for outputting demultiplexed files. regexLibrary
Regular expression for finding the library name within the .fastq files. The library name will be extracted and used to look up the corresponding barcode file for that library. *Example: regexLibrary = "Library\d{1}*"
will identify "Library1".out_prefix = 'demultiplexed\_'
Prefix to prepend to demultiplexed file names. Default is "demultiplexed_" but any text string is acceptable. -
Demultiplex
Argument Help in_file
Full path to a .fastq file to be demultiplexed. barcode_file
Full path to a barcode file to use for demultiplexing. The barcode file should follow the format specified in the FASTX Barcode Splitter documentation. out_dir
Full path to directory for outputting demultiplexed files. out_prefix = 'demultiplexed\_'
Prefix to prepend to demultiplexed file names. Default is "demultiplexed_" but any text string is acceptable. -
denovo_Ustacks
Argument Help in_dir
denovo_path
stacks_executables
out_dir
m
n
b
D
-
denovo_Cstacks
Argument Help in_dir
denovo_path
stacks_executables
out_dir
m
n
b
D
-
GeneratePseudoref
Argument Help in_dir
out_file
BWA_path
-
refmap_BWA
Argument Help in_dir
out_dir
BWA_path
pseudoref_full_path
-
callGeno
Argument Help sam_in
pseudoref
BCFout
VCFout
-
configureLogging
Not fully implemented, but in theory this will help with generating logfiles. For the time being, stdout and stderr capture most of the relevant information.
-
checkFile
Not fully implemented, but in theory checks to make sure input files exist and have the appropriate type and raises an exception if false.
-
checkExe
Not fully implemented, but in theory checks to make sure executable files exist and returns True/False.