MOST UPDATED SCRIPTS ARE IN A NEW REPOSITORY:

https://github.com/kellyp2738/RADseqDuplicateFiltering

DRAFT DBR Processing Pipeline

Overview

The DBR Processing Pipeline detects and removed PCR duplicates from paired-end Illumina sequence data that have been generated with DBR tags in read 2.

This pipeline can be used for reference-mapped or de novo assembly; however, for de novo assembly, a pseudoreference must be generated. A pseudoreference sequence is the consensus sequence generated by a de novo assembly. The

There are two main components to the pipeline for processing DBR-tagged sequences:

assembled_DBR_filtering.py generates a hash table associating Illumina Hi-Seq sequence IDs with their associated DBRs.
integrated_denovo_pipeline.py generates the initial DBR_dict from the raw sequence data, and also performs other optional quality filtering and assembly operations. Users can substitute any other assembly methods as long as they generate SAM files as output, as DBR filtering is dependent on the existence of SAM files.

TODO: Move DBR_dict function from integrated_denovo_pipeline.py to assembled_DBR_filtering.py. This reflects a better organization: optional pipeline components for assembly are in integrative_denovo_pipeline.py and the DBR specific components are in assembled_DBR_filtering.py.

General operation of pipeline

Run DBR_dict (integrated_denovo_pipeline.py)
Assemble sequences using any pipeline that produces output in SAM format. Additional functions in integrated_denovo_pipeline.py will produce output in SAM format, but their use is not required.
Run assembled_DBR_filter.py to identify over-representation of DBR sequences in SAM files and remove those sequences. Output the resulting data to FASTA format.
Re-run the assembly pipeline on the DBR-filtered dataset to produce a robust assembly with PCR duplicates removed.

Requirements

Python 2.7
Linux (untested on OS X but likely to work; untested on Windows)
Optional software for running assembly functions in integrative_denovo_pipeline.py
- FastX toolkit
- Pear
- Stacks
- BWA
- Samtools
- Vcftools

Installation

These are stand-alone scripts; no separate installation is necessary. The scripts import Python modules that will be part of any basic Python installation.

Optional software

Please see original documentation for installation instructions for external software.

PEAR
FASTX Toolkit
Stacks
BWA
Samtools/Vcftools/Bcftools

Expected Data

Raw Illumina HiSeq paired-end data with DBR sequences in read 2. Quality scores should be coded in ASCII-33 format.
Assembled sequence data in SAM format. Each SAM file should have A. The sample ID as part of the file name B. If data from multiple libraries are to be used, the library name (or any unique identifier) in the SAM file name.
Barcode sequences for demultiplexing, with the library name (or any unique identifier) in the file name. This must match the library name found in the SAM file.

assembled_DBR_filtering.py

User-controlled functions

DBR_Filter

Function overview

This function performs the main filtering to remove PCR duplicates based on the DBR tags.

The assembled SAM file for a single sample is loaded, and regular expressions are used to find a load the barcode file and DBR dictionary (as produced by DBR_dict) that contain data for that sample. The DBR_dict created from the raw data has the following structure

 {Sequence ID : DBR}

The SAM file is read line-by-line, and a set of two additonal dictionaries for that SAM file are created:

A. assembly_dict_2: This dictionary is used to track the number of times each unique DBR occurs as reference mapped to the same location in the reference.

 	{RNAME : {DBR: count}}

B. assembly_dict_3: This dictionary is used to track the ID and quality for each assembled sequence. Sequences that pass the DBR filtering threshold will be extacted from this dictionary and saved as a new .fasta file (see "n_expected" below for details on the DBR filtering threshold).

 {RNAME : {DBR: {Sequence ID, QUAL, SEQ}}}

Glossary

Name	Definition
Sequence ID	Illumina HiSeq ID, listed under "QNAME" in SAM file.
RNAME	The reference sequence ID to which the individual sequences map. (RNAME is a SAM file header)
DBR	The DBR sequence retrieved by matching QNAME from the SAM file to the Sequence ID from the DBR_dict.
count	The number of occurrences of a specific DBR in sequences mapped to the same reference sequence.
QUAL	The ASCII quality sequence corresponding to the Sequence ID. (QUAL is a SAM file header)
SEQ	The actual sequence corresponding to the Sequence ID. (SEQ is a SAM file header)

Function operation

Argument	Help
`assembled_dir`	Full path to directory containing SAM files to filter.
`out_dir`	Full path to output .fasta file for saving filtered sequences and the run logfile.
`n_expected`	Integer. maximum number of replicate DBRs tolerated.
`barcode_dir`	The full file path to the directory containing the unique sample barcodes by which samples are demultiplexed.
`dict_dir`	The full file path to the directory containing the DBR dictionaries created by DBR_dict.
`barcode_file = None`	Deprecated. Used for analyzing only 1 library.
`test_dict = True`	Logical. If True, print samples of dictionary entries to check for proper formatting.
`phred_dict = phred_dict`	Globally defined variable; dictionary type. Converts ASCII-33 quality scores to integers which can be used in a median calculation. ASCII-64 is not supported.
`samMapLen = None`	Integer or None. Expected length of map matches in the SAM file.

Silently called functions

Users do not manipulate these functions directly. They are instead called by DBR_Filter to parse input data for the filtering process.

qual_median

Calculates the median quality score of a sequence. It is called by DBR_Filter during the filtering process.

Argument	Help
`QUAL`	ASCII quality string extracted from SAM file.
`phred_dict`	Globally defined variable; dictionary type. Converts ASCII-33 quality scores to integers which can be used in a median calculation. ASCII-64 is not supported.

find_SampleID

Extracts the sample ID from a SAM file name using a regular expression. That regular expression is hard-coded as
```
 \d{1,3}T?
```
This identifies 1-3 digits for the sample ID optionally followed by a "T" indicating samples which are technical replicates.

Future versions of the software will allow for this regular expresson to be user-specified.

Argument Help

filename SAM file name.
find_LibraryID

Extracts the library ID from a SAM file name using a regular expression. That regular expression is hard-coded as
```
 Library\d{1,3}[A|B]?
```
This identifies the word "Library" followed by 1-3 digits for the library ID, optionally followed by an "A" or "B" for libraries that were split across multiple lanes.

Future versions of the software will allow for this regular expresson to be user-specified.

Argument Help

filename SAM file name.
find_BarcodeFile

Argument Help

library The value returned by find_LibraryID.

directory The full file path to the directory containing the unique sample barcodes by which samples are demultiplexed.
find_DBRdictionary

Argument Help

library The value returned by find_LibraryID.

directory The full file path to the directory containing the DBR dictionaries created by DBR_dict.

integrated_denovo_pipeline.py

User-controlled functions

Integral parts of DBR filtering

parallel_DBR_dict

Passes arguments to DBR_dict to create DBR dictionaries for all .fastq files in a directory.

Recommended to run on quality filtered data.

Argument	Help
`in_dir`	Full path to directory containing a set of .fastq files.
`seqType`	'read2' or 'pear'. If 'read2', expect only read 2 files in `in_dir`. If 'pear', expect only read 1 and read 2 merged with PEAR in "in_dir".
`dbr_start`	Integer. Index of first base in DBR. NOTE: Python indexing starts at 0.
`dbr_stop`	Integer. Index of first base after end of DBR. EXAMPLE: If DBR ends at base position 9, dbr_stop = 10.
`test_dict = False`	Logical. If True, print samples of dictionary entries to check for proper formatting.
`save = None`	Optional text string suffix to append to output file name. DBR dictionaries inherit the name of the file used to create them, plus an additional (optional) user-specified text string, and an automatically added .json extension. If save = None, the DBR dictionary file name will be the original file name with a .json extension.

DBR_dict

Create a DBR dictionary for a .fastq file.

Recommended to run on quality filtered data.

Argument	Help
`in_file`	Full path to a single .fastq file
`dbr_start`	Integer. Index of first base in DBR. NOTE: Python indexing starts at 0.
`dbr_stop`	Integer. Index of first base after end of DBR. EXAMPLE: If DBR ends at base position 9, `dbr_stop` = 10.
`test_dict = False`	Logical. If True, print samples of dictionary entries to check for proper formatting.
`save = None`	Optional text string suffix to append to output file name. DBR dictionaries inherit the name of the file used to create them, plus an additional (optional) user-specified text string, and an automatically added .json extension. If save = None, the DBR dictionary file name will be the original file name with a .json extension.

Optional Python wrappers to external software

The functions below are wrappers to several external software packages. These wrappers have two purposes: (1) to facilitate batch processing of multiple input files in parallel when appropriate, and (2) to allow sequential execution of multiple steps from a single Python script to call all of the functions. Calls to these wrapper functions can be done in any order the user chooses, provided that the input and output directories are all properly specified.

Sample Workflow

Merge overlapping reads with PEAR (only if overlap between read 1 and read 2 is expected).
Quality filter merged reads with FASTQ Quality Filter.
Make DBR dictionaries using integrated_denovo_pipeline.py.
Demultiplex merged, quality filtered reads.
Trim the demultiplexed sequences to uniform length (only if using Stacks for de novo assembly, which requires uniform read lengths).
Run ustacks and cstacks scripts from Stacks to complete de novo assembly.
Extract consensus sequences from Stacks catalog using the GeneratePseudoref function in integrated_denovo_pipeline.py.
Reference map the trimmed sequences (output of step 5) to the consensus sequence extracted in step 7.
Filter out over-represented DBRs with the DBR_Filter function in assembled_DBR_filtering.py. The output of this step is a single .fastq per library (un-demultiplexed -- will be updated to output a single .fastq per sample at some point).
Repeat steps 4-8 using the DBR filtered data.
Call genotypes using Samtools Mpileup.

These Python wrappers are not necessary to use the external software. Please note that these Python wrappers do not add any functionality to the external software. Furthermore, these wrappers are not intended to cover all possible usages of the external software, and in many cases their specific implementation may not be suitable for all needs.

All external software should be appropriately cited.

Please refer to the developer documentation for further detail about functionality and options.

parallel_PEAR_assemble

Passes arguments to PEAR_assemble to merge read pairs for all pairs in a directory.

Argument	Help
`regexR1`	Regular expression for finding Read 1 in `in_dir`.
`regexR2`	Regular expression for finding Read 2 in `in_dir`.
`regexLibrary`	Regular expression for finding the library name within the files. The library name will be extracted and used in the output file name.
`in_dir`	Full path to directory of read 1 and read 2 .fastq files.
`out_dir`	Full path to output directory for merged reads.
`out_name = 'pear_merged\_'`	Prefix to prepend to library name to create output file name. Default is 'pear_merged_' but any text string is acceptable.
`extra_params = 'None'`	Extra parameters to pass to PEAR, e.g. minimum and maximum merged read lengths. Default is no extra parameters.
`pearPath`	Path to PEAR executable.

Pear_assemble

Assemble overlapping reads using PEAR.

Argument	Help
`R1`	Full path to Read 1.
`R2`	Full path to Read 2.
`out_dir`	Full path to output directory for merged reads.
`out_file`	Name for output file.
`extra_params`	Extra parameters to pass to PEAR, e.g. minimum and maximum merged read lengths. Default is no extra parameters.
`pearPath`	Path to PEAR executable.

parallel_FASTQ_quality_filter

Passes arguments to FASTQ_quality_filter to filter .fastq files using FASTQ Quality Filter from the FASTX Toolkit.

Argument	Help
`in_dir`	Full path to directory of .fastq files for quality filtering.
`out_dir`	Full path to output directory for filtered .fastq.gz files.
`out_name`	Suffix to append to output file name.
`q`	Minimum quality score to keep (`-q` in FASTQ Quality Filter documentation).
`p`	Minimum percent of bases that must have `-q` quality (`-p` in FASTQ Quality Filter documentation).
`qualityFilter`	Path to FASTQ Quality Filter executable.
`read = '*'`	Regex for locating reads to quality filter. Default is all reads in `in_dir`

FASTQ_quality_filter

Quality filters .fastq files using FASTQ Quality Filter from the FASTX Toolkit.

Argument	Help
`in_file`	Full path to input .fastq file for quality filtering (`-i` in FASTQ Quality Filter Documentation).
`out_file`	Full path to output file for filtered .fastq.gz files (`-o` in FASTQ Quality Filter Documentation).
`q`	Minimum quality score to keep (`-q` in FASTQ Quality Filter documentation).
`p`	Minimum percent of bases that must have `-q` quality (`-p` in FASTQ Quality Filter documentation).
qualityFilter	Path to FASTQ Quality Filter executable.

parallel_Trim

Passes arguments to Trim for trimming a directory of .fastq files using FASTQ Trimmer from the FASTX Toolkit.

Argument	Help
`in_dir`	Full path to directory of .fastq files for trimming with FASTQ Trimmer.
`out_dir`	Full path to output directory for trimmed .fastq files.
`trimPath`	Path to FASTQ Trimmer executable.
`first_base`	First base to keep (`-f` in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).
`last_base = None`	Last base to keep. Default is None, meaning all bases after the first base to keep `first_base` will be retained. (`-l` in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).
`suffix = '_trimmed.fq'`	Text string to be appended to output file name. Default is '_trimmed.fq' but any text string is acceptable.

Trim

Trims a .fastq file using FASTQ Trimmer from the FASTX Toolkit.

Argument	Help
`in_file`	Full path to .fastq file for trimming with FASTQ Trimmer.
`out_file`	Full path to output directory for trimmed .fastq files.
`first_base`	First base to keep (`-f` in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).
`last_base = None`	Last base to keep. Default is None, meaning all bases after the first base to keep `first_base` will be retained. (`-l` in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).
`trimPath`	Path to FASTQ Trimmer executable.
`execute = True`	Should the call to FASTQ Trimmer be added to the `Trim` function's internal processing queue (`execute = True`), or added to a thread-limited external processing queue (`execute = False`). Select `execute = True` when there are more available processors than there are files to trim, and all can be processed simultaneously. Select `execute = False` when there are more files to trim than there are processors available for more efficient handling of parallel processing by external libraries.

iterative_Demultiplex

Argument	Help
`in_dir`	Full path to directory of .fastq files to be demultiplexed. The .fastq files in this directory must have names that correspond to barcode file names such that a regular expression can be used to pair them. Example: Library1_quality_filtered.fastq.
`barcode_dir`	Full path to directory of barcode files corresponding to .fastq files for demultiplexing. Individual barcode files in this directory should follow the format specified in the FASTX Barcode Splitter documentation. The barcode files in this directory must have names that correspond to the .fastq file names such that a regular expression can be used to pair them. Example: Library1_barcodes.txt.
`out_dir`	Full path to directory for outputting demultiplexed files.
`regexLibrary`	Regular expression for finding the library name within the .fastq files. The library name will be extracted and used to look up the corresponding barcode file for that library. Example: `regexLibrary = "Library\d{1}"` will identify "Library1".
`out_prefix = 'demultiplexed\_'`	Prefix to prepend to demultiplexed file names. Default is "demultiplexed_" but any text string is acceptable.

Demultiplex

Argument	Help
`in_file`	Full path to a .fastq file to be demultiplexed.
`barcode_file`	Full path to a barcode file to use for demultiplexing. The barcode file should follow the format specified in the FASTX Barcode Splitter documentation.
`out_dir`	Full path to directory for outputting demultiplexed files.
`out_prefix = 'demultiplexed\_'`	Prefix to prepend to demultiplexed file names. Default is "demultiplexed_" but any text string is acceptable.

denovo_Ustacks

Argument Help

in_dir

denovo_path

stacks_executables

out_dir

m

n

b

D
denovo_Cstacks

Argument Help

in_dir

denovo_path

stacks_executables

out_dir

m

n

b

D
GeneratePseudoref

Argument Help

in_dir

out_file

BWA_path
refmap_BWA

Argument Help

in_dir

out_dir

BWA_path

pseudoref_full_path
callGeno

Argument Help

sam_in

pseudoref

BCFout

VCFout

Silently called functions

configureLogging

Not fully implemented, but in theory this will help with generating logfiles. For the time being, stdout and stderr capture most of the relevant information.
checkFile

Not fully implemented, but in theory checks to make sure input files exist and have the appropriate type and raises an exception if false.
checkExe

Not fully implemented, but in theory checks to make sure executable files exist and returns True/False.

Name		Name	Last commit message	Last commit date
Latest commit History 298 Commits
BarcodesRound1		BarcodesRound1
BarcodesRound2		BarcodesRound2
DBR_sensitivity		DBR_sensitivity
Notes		Notes
RevisedBarcodes		RevisedBarcodes
Scripts		Scripts
workspaceCopy		workspaceCopy
PearAssemblyTest.Rmd		PearAssemblyTest.Rmd
PlotPrevCI.r		PlotPrevCI.r
README.md		README.md
RelatednessNetwork.rmd		RelatednessNetwork.rmd
barcodes_coverage.r		barcodes_coverage.r

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MOST UPDATED SCRIPTS ARE IN A NEW REPOSITORY:

https://github.com/kellyp2738/RADseqDuplicateFiltering

DRAFT DBR Processing Pipeline

Overview

General operation of pipeline

Requirements

Installation

Optional software

Expected Data

assembled_DBR_filtering.py

User-controlled functions

Silently called functions

integrated_denovo_pipeline.py

User-controlled functions

Integral parts of DBR filtering

Optional Python wrappers to external software

Sample Workflow

Silently called functions

About

Releases

Packages

Languages

Argument	Help
`library`	The value returned by find_LibraryID.
`directory`	The full file path to the directory containing the unique sample barcodes by which samples are demultiplexed.

Argument	Help
`in_dir`
`denovo_path`
`stacks_executables`
`out_dir`
`m`
`n`
`b`
`D`

Argument	Help
`sam_in`
`pseudoref`
`BCFout`
`VCFout`

kellypierce/CSU_ChronicWasting

Folders and files

Latest commit

History

Repository files navigation

MOST UPDATED SCRIPTS ARE IN A NEW REPOSITORY:

https://github.com/kellyp2738/RADseqDuplicateFiltering

DRAFT DBR Processing Pipeline

Overview

General operation of pipeline

Requirements

Installation

Optional software

Expected Data

assembled_DBR_filtering.py

User-controlled functions

Silently called functions

integrated_denovo_pipeline.py

User-controlled functions

Integral parts of DBR filtering

Optional Python wrappers to external software

Sample Workflow

Silently called functions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages