-
Notifications
You must be signed in to change notification settings - Fork 12
Importing HTS data
When sequencing aptamers from HT-SELEX experiments, individual selection cycles are typically sequenced together via a technique known as multiplexing. Multiplexing involves extending aptamers from each round of selection with a unique barcode, a known sequence of typically 5-10 nt in length. The aptamers of the individual selection cycles are consequently mixed and sequenced together and separated (demultiplexed) downstream using computational tools. While a number of general-purpose demultiplexing software exists, non of these takes the specific properties and structure of aptamers into account. For instance, aptamers are usually composed of a randomized region of predefined length and flanked by primer sites required during the amplification stages of SELEX. Furthermore, depending on the read length used during sequencing, the aptamers might be smaller and the reads hence contain additional nucleotides after the 3' primer which are required to be removed prior to analysing the data.
AptaSUITE utilizes AptaPLEX, an efficient and multithreaded demultiplexer that is specifically designed for HT-SELEX data, taking the unique properties of aptamers into account. Given the data (single end and paired-end is supported) from an HT-SELEX experiment in fastq format, AptaPLEX is capable of partitioning the reads into the individual selection cycles based on the barcodes used during sequencing. Simultaneously, AptaPLEX identifies the 5' primer and the 3' primer in each read and removes any additional nucleotides on either side that do not belong to the original aptamer. AptaPLEX is capable of fuzzy matching for both, the barcode and primers, allowing for a user-specified number of mismatches between the best match of the read and barcode/primers. Additionally, for paired-end data, AptaPLEX automatically corrects mismatches between forward and reverse reads up to a user-defined threshold. AptaPLEX automatically handles gzip compressed files and makes use of all available processing resources via its multi-threaded design while minimizing memory usage. This java version of AptaPLEX uses components of the excellent MiTool library for its operations.
AptaPLEX can be called with the following command within AptaSUITE:
java -jar aptasuite.jar -parse -config /path/to/configuration/file
During parsing, a number of statistics are displayed in real time in both the CLI and GUI which are aimed at providing an overview of the data import progress. These are
- Total Processed Reads: The total number of reads which have been processed including accepted and discarded reads.
- Total Accepted Reads: The number of reads which have passed all the quality controls and parsing checks.
- Contig Assembly Failure: The number of reads which were discarded due to AptaPLEX not being able to assemble the forward and reverse read into a continuous contig (only in paired end mode).
-
Invalid Alphabet: The number of reads discarded due to the contig containing letter other than
A
,C
,G
, orT
. - 5' Primer Error: The number of reads discarded due to not being able to identify the 5' primer on the contig.
- 3' Primer Error: The number of reads discarded due to not being able to identify the 3' primer on the contig.
- Invalid Cycle: The number of reads discarded because no barcode could be matched to the contig (only in multiplexed mode).
- Total Primer Overlaps: Number of discarded reads due to the identified 5' and 3' primer regions overlapping on the contig.
Use the New Experiment
option in the File
menu to start the Wizard which will guide you though importing data into AptaSuite.
AptaPLEX takes as input the sequencing data of one HT-SELEX experiment in fastq, fasta or raw (one line per sequence) format. It supports both, single-end and paired-end data in either uncompressed or gzipped form. The reads in these files are expected to contain a barcode followed by a 5' primer, the randomized region, and the 3' primer. In other words, the assembled read should have the following format,
N(A)-BARCODE5-N(B)-PRIMER5-N(C)-PRIMER3-BARCODE3-N(D)
where N(A),N(B),N(C), and N(D) can be an arbitrary nucleotide sequence of any length (including 0).
The following options must be specified on order for AptaPLEX to succeed.
# If the data has previously been de-multiplexed using a third party tool and is
# present as one file per selection cycle, set this value to true. The default is true.
AptaplexParser.isPerFile = true
If AptaplexParser.isPerFile = True
, the following parameters specify the input data
# An equal number of files as there are selection cycles must be specified and
# in the same order
AptaplexParser.forwardFiles = path/to/round_0.fastq
AptaplexParser.forwardFiles = path/to/round_1.fastq
AptaplexParser.forwardFiles = path/to/round_3.fastq
AptaplexParser.forwardFiles = path/to/round_5.fastq
If AptaplexParser.isPerFile = False
the configuration looks slightly different. In addition the barcodes (indices) must be specified in the same order as the selection cycles.
# One or more input files for the forward reads. If the data was is not paired-end,
# specify the single-end data here.
AptaplexParser.forwardFiles = path/to/forward/reads1.fastq
AptaplexParser.forwardFiles = path/to/forward/reads2.fastq
.
.
.
AptaplexParser.forwardFiles = path/to/forward/readsN.fastq
# One or more input files for the reverse reads. The number and order of the files
# must coincide with the forwardFiles.
AptaplexParser.reverseFiles= path/to/reverse/reads1.fastq
AptaplexParser.reverseFiles= path/to/reverse/reads2.fastq
.
.
.
AptaplexParser.reverseFiles = path/to/reverse/readsN.fastq
# The five prime barcodes. Must be comma separated and in the same order
# as SelectionCycles
AptaplexParser.barcodes5Prime = ATGCGT, GACGAC, GGTACC, TCGTAG, CCATGG
# OPTIONAL (specify only if present in the sequencing data), the three
# prime barcodes. Must be in order of SelectionCycles and in 5' to 3' of
# the Forward Read.
AptaplexParser.barcodes3Prime = TAGCCA, ATCGAT, AATCAA, ATCGTA, GGTTAA
Currently, AptaPLEX supports a number of different input formats for the sequence reads, the default of which is currently fastq
. The format can be changed as follows:
# Specifies the reader for the sequences depending on the input format (case sensitive).
# Current options are: FastqReader, RawReader
AptaplexParser.reader = FastqReader
In addition, the stringency of the parser can be modified with the following parameters
# For paired-end data only. The smallest overlap required between the forward and
# reverse read when creating a single contig out of the two.
AptaplexParser.PairedEndMinOverlap = 15
# Maximal number of mutations in the overlapping region for a sequence to be accepted
AptaplexParser.PairedEndMaxMutations = 5
# Highest score of the current quality. 55 for phred model.
AptaplexParser.PairedEndMaxScoreValue = 55
# Maximal number of mutations allowed in the barcodes
AptaplexParser.BarcodeTolerance = 1
# Maximal number of mutations allowed in the primer regions
AptaplexParser.PrimerTolerance = 3
# If DNA aptamers were used during the selection, it is likely that they were sequenced in reverse complement order
# By setting this option to true, AptaPlex will automatically convert the cDNA back into DNA
AptaplexParser.StoreReverseComplement = False
# If set to true, AptaPlex will attempt to demultiplex and extract the randomized region of
# of the reverse complement of a contig should the initial attempt have failed. This setting is
# useful if you expect aptamers to be present as a mixture of forward and
# reverse-complements orientations in your sequencing data
AptaplexParser.CheckReverseComplement = False
# If set to true, AptaPlex will assume that the barcodes AND the Primers have already been removed
# by a third party application and will import the sequences without any checks (other than nucleotide validity).
# In addition, it is assumed that the data has already been demultiplexed. In other words, AptaplexParser.isPerFile
# needs to be set to true
AptaplexParser.OnlyRandomizedRegionInData = False
# If set to true, AptaPlex will dump all read which failed processing for any reason to a fastq file located
# in the export folder of the project. The naming convention for these files is undetermined_name_of_source_file.fastq.[gz]
AptaplexParser.UndeterminedToFile = False
AptaPLEX processes the reads in parallel using a producer-consumer model. The size of the queue containing the items to be processed can be controlled with
AptaplexParser.BlockingQueueSize = 500
Depending on the data at hand, the user might wish to only include aptamers within a certain range of the desired randomized region length, e.g. to account for a small number of insertions or deletions that were introduced during the selection. As of AptaSuite v0.8.8, this behavior can be configured using the following parameters.
# Specifies the smallest randomized region size to be accepted (inclusive)
AptaplexParser.randomizedRegionSizeLowerBound = 45
# Specifies the largest randomized region size to be accepted (inclusive)
AptaplexParser.randomizedRegionSizeUpperBound = 55
Note that if Experiment.randomizedRegionSize
, is specified, the range will be ignored in favor of this parameter.
Starting with version 0.9.7b, a new option called batch mode has been added to AptaSuite. When activated, this option will import every sequence without checking for the presence of primers or barcodes and will only check if the length of the sequence conforms to either Experiment.randomizedRegionSize
, or AptaplexParser.randomizedRegionSizeLowerBound
- AptaplexParser.randomizedRegionSizeUpperBound
. Note that for this mode to work, the input files must be demultiplex and single-read. This mode is useful when wanting to import preprocessed aptamers without any primers attached to them and analyze these within AptaSuite. To activate batch mode, set
AptaplexParser.BatchMode = true