Skip to content

Scripts for chronic wasting genome-wide association study. Please see repo 'RADseqDuplicateFiltering' for stand-alone filtering scripts.

Notifications You must be signed in to change notification settings

kellypierce/CSU_ChronicWasting

Repository files navigation

MOST UPDATED SCRIPTS ARE IN A NEW REPOSITORY:

DRAFT DBR Processing Pipeline

Overview

The DBR Processing Pipeline detects and removed PCR duplicates from paired-end Illumina sequence data that have been generated with DBR tags in read 2.

This pipeline can be used for reference-mapped or de novo assembly; however, for de novo assembly, a pseudoreference must be generated. A pseudoreference sequence is the consensus sequence generated by a de novo assembly. The

There are two main components to the pipeline for processing DBR-tagged sequences:

  1. assembled_DBR_filtering.py generates a hash table associating Illumina Hi-Seq sequence IDs with their associated DBRs.

  2. integrated_denovo_pipeline.py generates the initial DBR_dict from the raw sequence data, and also performs other optional quality filtering and assembly operations. Users can substitute any other assembly methods as long as they generate SAM files as output, as DBR filtering is dependent on the existence of SAM files.

TODO: Move DBR_dict function from integrated_denovo_pipeline.py to assembled_DBR_filtering.py. This reflects a better organization: optional pipeline components for assembly are in integrative_denovo_pipeline.py and the DBR specific components are in assembled_DBR_filtering.py.

General operation of pipeline

  1. Run DBR_dict (integrated_denovo_pipeline.py)
  2. Assemble sequences using any pipeline that produces output in SAM format. Additional functions in integrated_denovo_pipeline.py will produce output in SAM format, but their use is not required.
  3. Run assembled_DBR_filter.py to identify over-representation of DBR sequences in SAM files and remove those sequences. Output the resulting data to FASTA format.
  4. Re-run the assembly pipeline on the DBR-filtered dataset to produce a robust assembly with PCR duplicates removed.

Requirements

  1. Python 2.7
  2. Linux (untested on OS X but likely to work; untested on Windows)
  3. Optional software for running assembly functions in integrative_denovo_pipeline.py
    • FastX toolkit
    • Pear
    • Stacks
    • BWA
    • Samtools
    • Vcftools

Installation

These are stand-alone scripts; no separate installation is necessary. The scripts import Python modules that will be part of any basic Python installation.

Optional software

Please see original documentation for installation instructions for external software.

  1. PEAR
  2. FASTX Toolkit
  3. Stacks
  4. BWA
  5. Samtools/Vcftools/Bcftools

Expected Data

  1. Raw Illumina HiSeq paired-end data with DBR sequences in read 2. Quality scores should be coded in ASCII-33 format.
  2. Assembled sequence data in SAM format. Each SAM file should have A. The sample ID as part of the file name B. If data from multiple libraries are to be used, the library name (or any unique identifier) in the SAM file name.
  3. Barcode sequences for demultiplexing, with the library name (or any unique identifier) in the file name. This must match the library name found in the SAM file.

assembled_DBR_filtering.py

User-controlled functions

  1. DBR_Filter

    Function overview

    This function performs the main filtering to remove PCR duplicates based on the DBR tags.

    The assembled SAM file for a single sample is loaded, and regular expressions are used to find a load the barcode file and DBR dictionary (as produced by DBR_dict) that contain data for that sample. The DBR_dict created from the raw data has the following structure

     {Sequence ID : DBR}
    

    The SAM file is read line-by-line, and a set of two additonal dictionaries for that SAM file are created:

    A. assembly_dict_2: This dictionary is used to track the number of times each unique DBR occurs as reference mapped to the same location in the reference.

     	{RNAME : {DBR: count}}
    

    B. assembly_dict_3: This dictionary is used to track the ID and quality for each assembled sequence. Sequences that pass the DBR filtering threshold will be extacted from this dictionary and saved as a new .fasta file (see "n_expected" below for details on the DBR filtering threshold).

     {RNAME : {DBR: {Sequence ID, QUAL, SEQ}}}
    

    Glossary

    Name Definition
    Sequence ID Illumina HiSeq ID, listed under "QNAME" in SAM file.
    RNAME The reference sequence ID to which the individual sequences map. (RNAME is a SAM file header)
    DBR The DBR sequence retrieved by matching QNAME from the SAM file to the Sequence ID from the DBR_dict.
    count The number of occurrences of a specific DBR in sequences mapped to the same reference sequence.
    QUAL The ASCII quality sequence corresponding to the Sequence ID. (QUAL is a SAM file header)
    SEQ The actual sequence corresponding to the Sequence ID. (SEQ is a SAM file header)

    Function operation

    Argument Help
    assembled_dir Full path to directory containing SAM files to filter.
    out_dir Full path to output .fasta file for saving filtered sequences and the run logfile.
    n_expected Integer. maximum number of replicate DBRs tolerated.
    barcode_dir The full file path to the directory containing the unique sample barcodes by which samples are demultiplexed.
    dict_dir The full file path to the directory containing the DBR dictionaries created by DBR_dict.
    barcode_file = None Deprecated. Used for analyzing only 1 library.
    test_dict = True Logical. If True, print samples of dictionary entries to check for proper formatting.
    phred_dict = phred_dict Globally defined variable; dictionary type. Converts ASCII-33 quality scores to integers which can be used in a median calculation. ASCII-64 is not supported.
    samMapLen = None Integer or None. Expected length of map matches in the SAM file.

Silently called functions

Users do not manipulate these functions directly. They are instead called by DBR_Filter to parse input data for the filtering process.

  1. qual_median

    Calculates the median quality score of a sequence. It is called by DBR_Filter during the filtering process.

    Argument Help
    QUAL ASCII quality string extracted from SAM file.
    phred_dict Globally defined variable; dictionary type. Converts ASCII-33 quality scores to integers which can be used in a median calculation. ASCII-64 is not supported.
  2. find_SampleID

    Extracts the sample ID from a SAM file name using a regular expression. That regular expression is hard-coded as

     \d{1,3}T?
    

    This identifies 1-3 digits for the sample ID optionally followed by a "T" indicating samples which are technical replicates.

    Future versions of the software will allow for this regular expresson to be user-specified.

    Argument Help
    filename SAM file name.
  3. find_LibraryID

    Extracts the library ID from a SAM file name using a regular expression. That regular expression is hard-coded as

     Library\d{1,3}[A|B]?
    

    This identifies the word "Library" followed by 1-3 digits for the library ID, optionally followed by an "A" or "B" for libraries that were split across multiple lanes.

    Future versions of the software will allow for this regular expresson to be user-specified.

    Argument Help
    filename SAM file name.
  4. find_BarcodeFile

    Argument Help
    library The value returned by find_LibraryID.
    directory The full file path to the directory containing the unique sample barcodes by which samples are demultiplexed.
  5. find_DBRdictionary

    Argument Help
    library The value returned by find_LibraryID.
    directory The full file path to the directory containing the DBR dictionaries created by DBR_dict.

integrated_denovo_pipeline.py

User-controlled functions

Integral parts of DBR filtering

  1. parallel_DBR_dict

    Passes arguments to DBR_dict to create DBR dictionaries for all .fastq files in a directory.

    Recommended to run on quality filtered data.

    Argument Help
    in_dir Full path to directory containing a set of .fastq files.
    seqType 'read2' or 'pear'. If 'read2', expect only read 2 files in in_dir. If 'pear', expect only read 1 and read 2 merged with PEAR in "in_dir".
    dbr_start Integer. Index of first base in DBR. NOTE: Python indexing starts at 0.
    dbr_stop Integer. Index of first base after end of DBR. EXAMPLE: If DBR ends at base position 9, dbr_stop = 10.
    test_dict = False Logical. If True, print samples of dictionary entries to check for proper formatting.
    save = None Optional text string suffix to append to output file name. DBR dictionaries inherit the name of the file used to create them, plus an additional (optional) user-specified text string, and an automatically added .json extension. If save = None, the DBR dictionary file name will be the original file name with a .json extension.
  2. DBR_dict

    Create a DBR dictionary for a .fastq file.

    Recommended to run on quality filtered data.

    Argument Help
    in_file Full path to a single .fastq file
    dbr_start Integer. Index of first base in DBR. NOTE: Python indexing starts at 0.
    dbr_stop Integer. Index of first base after end of DBR. EXAMPLE: If DBR ends at base position 9, dbr_stop = 10.
    test_dict = False Logical. If True, print samples of dictionary entries to check for proper formatting.
    save = None Optional text string suffix to append to output file name. DBR dictionaries inherit the name of the file used to create them, plus an additional (optional) user-specified text string, and an automatically added .json extension. If save = None, the DBR dictionary file name will be the original file name with a .json extension.

Optional Python wrappers to external software

The functions below are wrappers to several external software packages. These wrappers have two purposes: (1) to facilitate batch processing of multiple input files in parallel when appropriate, and (2) to allow sequential execution of multiple steps from a single Python script to call all of the functions. Calls to these wrapper functions can be done in any order the user chooses, provided that the input and output directories are all properly specified.

Sample Workflow

  1. Merge overlapping reads with PEAR (only if overlap between read 1 and read 2 is expected).
  2. Quality filter merged reads with FASTQ Quality Filter.
  3. Make DBR dictionaries using integrated_denovo_pipeline.py.
  4. Demultiplex merged, quality filtered reads.
  5. Trim the demultiplexed sequences to uniform length (only if using Stacks for de novo assembly, which requires uniform read lengths).
  6. Run ustacks and cstacks scripts from Stacks to complete de novo assembly.
  7. Extract consensus sequences from Stacks catalog using the GeneratePseudoref function in integrated_denovo_pipeline.py.
  8. Reference map the trimmed sequences (output of step 5) to the consensus sequence extracted in step 7.
  9. Filter out over-represented DBRs with the DBR_Filter function in assembled_DBR_filtering.py. The output of this step is a single .fastq per library (un-demultiplexed -- will be updated to output a single .fastq per sample at some point).
  10. Repeat steps 4-8 using the DBR filtered data.
  11. Call genotypes using Samtools Mpileup.

These Python wrappers are not necessary to use the external software. Please note that these Python wrappers do not add any functionality to the external software. Furthermore, these wrappers are not intended to cover all possible usages of the external software, and in many cases their specific implementation may not be suitable for all needs.

All external software should be appropriately cited.

Please refer to the developer documentation for further detail about functionality and options.

  1. parallel_PEAR_assemble

    Passes arguments to PEAR_assemble to merge read pairs for all pairs in a directory.

    Argument Help
    regexR1 Regular expression for finding Read 1 in in_dir.
    regexR2 Regular expression for finding Read 2 in in_dir.
    regexLibrary Regular expression for finding the library name within the files. The library name will be extracted and used in the output file name.
    in_dir Full path to directory of read 1 and read 2 .fastq files.
    out_dir Full path to output directory for merged reads.
    out_name = 'pear_merged\_' Prefix to prepend to library name to create output file name. Default is 'pear_merged_' but any text string is acceptable.
    extra_params = 'None' Extra parameters to pass to PEAR, e.g. minimum and maximum merged read lengths. Default is no extra parameters.
    pearPath Path to PEAR executable.
  2. Pear_assemble

    Assemble overlapping reads using PEAR.

    Argument Help
    R1 Full path to Read 1.
    R2 Full path to Read 2.
    out_dir Full path to output directory for merged reads.
    out_file Name for output file.
    extra_params Extra parameters to pass to PEAR, e.g. minimum and maximum merged read lengths. Default is no extra parameters.
    pearPath Path to PEAR executable.
  3. parallel_FASTQ_quality_filter

    Passes arguments to FASTQ_quality_filter to filter .fastq files using FASTQ Quality Filter from the FASTX Toolkit.

    Argument Help
    in_dir Full path to directory of .fastq files for quality filtering.
    out_dir Full path to output directory for filtered .fastq.gz files.
    out_name Suffix to append to output file name.
    q Minimum quality score to keep (-q in FASTQ Quality Filter documentation).
    p Minimum percent of bases that must have -q quality (-p in FASTQ Quality Filter documentation).
    qualityFilter Path to FASTQ Quality Filter executable.
    read = '*' Regex for locating reads to quality filter. Default is all reads in in_dir
  4. FASTQ_quality_filter

    Quality filters .fastq files using FASTQ Quality Filter from the FASTX Toolkit.

    Argument Help
    in_file Full path to input .fastq file for quality filtering (-i in FASTQ Quality Filter Documentation).
    out_file Full path to output file for filtered .fastq.gz files (-o in FASTQ Quality Filter Documentation).
    q Minimum quality score to keep (-q in FASTQ Quality Filter documentation).
    p Minimum percent of bases that must have -q quality (-p in FASTQ Quality Filter documentation).
    qualityFilter Path to FASTQ Quality Filter executable.
  5. parallel_Trim

    Passes arguments to Trim for trimming a directory of .fastq files using FASTQ Trimmer from the FASTX Toolkit.

    Argument Help
    in_dir Full path to directory of .fastq files for trimming with FASTQ Trimmer.
    out_dir Full path to output directory for trimmed .fastq files.
    trimPath Path to FASTQ Trimmer executable.
    first_base First base to keep (-f in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).
    last_base = None Last base to keep. Default is None, meaning all bases after the first base to keep first_base will be retained. (-l in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).
    suffix = '_trimmed.fq' Text string to be appended to output file name. Default is '_trimmed.fq' but any text string is acceptable.
  6. Trim

    Trims a .fastq file using FASTQ Trimmer from the FASTX Toolkit.

    Argument Help
    in_file Full path to .fastq file for trimming with FASTQ Trimmer.
    out_file Full path to output directory for trimmed .fastq files.
    first_base First base to keep (-f in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).
    last_base = None Last base to keep. Default is None, meaning all bases after the first base to keep first_base will be retained. (-l in FASTQ Trimmer, but note that this Python wrapper does not inherit FASTQ Trimmer defaults).
    trimPath Path to FASTQ Trimmer executable.
    execute = True Should the call to FASTQ Trimmer be added to the Trim function's internal processing queue (execute = True), or added to a thread-limited external processing queue (execute = False). Select execute = True when there are more available processors than there are files to trim, and all can be processed simultaneously. Select execute = False when there are more files to trim than there are processors available for more efficient handling of parallel processing by external libraries.
  7. iterative_Demultiplex

    Argument Help
    in_dir Full path to directory of .fastq files to be demultiplexed. The .fastq files in this directory must have names that correspond to barcode file names such that a regular expression can be used to pair them. Example: Library1_quality_filtered.fastq.
    barcode_dir Full path to directory of barcode files corresponding to .fastq files for demultiplexing. Individual barcode files in this directory should follow the format specified in the FASTX Barcode Splitter documentation. The barcode files in this directory must have names that correspond to the .fastq file names such that a regular expression can be used to pair them. Example: Library1_barcodes.txt.
    out_dir Full path to directory for outputting demultiplexed files.
    regexLibrary Regular expression for finding the library name within the .fastq files. The library name will be extracted and used to look up the corresponding barcode file for that library. *Example: regexLibrary = "Library\d{1}*" will identify "Library1".
    out_prefix = 'demultiplexed\_' Prefix to prepend to demultiplexed file names. Default is "demultiplexed_" but any text string is acceptable.
  8. Demultiplex

    Argument Help
    in_file Full path to a .fastq file to be demultiplexed.
    barcode_file Full path to a barcode file to use for demultiplexing. The barcode file should follow the format specified in the FASTX Barcode Splitter documentation.
    out_dir Full path to directory for outputting demultiplexed files.
    out_prefix = 'demultiplexed\_' Prefix to prepend to demultiplexed file names. Default is "demultiplexed_" but any text string is acceptable.
  9. denovo_Ustacks

    Argument Help
    in_dir
    denovo_path
    stacks_executables
    out_dir
    m
    n
    b
    D
  10. denovo_Cstacks

    Argument Help
    in_dir
    denovo_path
    stacks_executables
    out_dir
    m
    n
    b
    D
  11. GeneratePseudoref

    Argument Help
    in_dir
    out_file
    BWA_path
  12. refmap_BWA

    Argument Help
    in_dir
    out_dir
    BWA_path
    pseudoref_full_path
  13. callGeno

    Argument Help
    sam_in
    pseudoref
    BCFout
    VCFout

Silently called functions

  1. configureLogging

    Not fully implemented, but in theory this will help with generating logfiles. For the time being, stdout and stderr capture most of the relevant information.

  2. checkFile

    Not fully implemented, but in theory checks to make sure input files exist and have the appropriate type and raises an exception if false.

  3. checkExe

    Not fully implemented, but in theory checks to make sure executable files exist and returns True/False.

About

Scripts for chronic wasting genome-wide association study. Please see repo 'RADseqDuplicateFiltering' for stand-alone filtering scripts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published