Skip to content

Exporting Data

drivenbyentropy edited this page Nov 9, 2017 · 13 revisions

Apart from AptaTRACE, which exports data directly each time it is executed, AptaSUITE offers a rich array of export options in order to write the processed data to file in various formats. Currently, exporting the pool (i.e. all unique aptamers encountered during the SELEX procedure), the individual selection cycles (all aptamers which where sequenced for a particular round), the structure information for each aptamer, as well as the clusters as computed by AptaCLUSTER can be exported.

In order to export all of these data, AptaSUITE can be called as follows (note that there must not be any spaces between the comma separated list):

java -jar aptasuite.jar -export pool,cycles,structures,clusters -config /path/to/configuration/file 

AptaSUITE support exporting any combination of options. For example, to export only the pool and structure information, the program is called as follows:

java -jar aptasuite.jar -export pool,structures -config /path/to/configuration/file 

Mandatory Configuration File Parameters

When calling the -export option with the cycles or clusters parameter, by default, all selection cycles as specified by the SelectionCycle.name entries are exported to disk. This behavior can be controlled with the following configuration option

# Comma separated, specifies the selection rounds to be exported. The cycle 
# identifiers must coincide with the names chosen in SelectionCycle.name
Export.Cycles = Round0, Round5, Round8

Furthermore, when exporting the clusters, the smallest amount of members a cluster should contain in order to be exported can also be specified

# The smallest amount of members a cluster should contain in order to be exported
Export.MinimalClusterSize = 1 

Here, the definition of 'size' depends on the users choice and can be controlled though the Export.ClusterFilterCriteria parameter.

# Defines by which criteria Export.MinimalClusterSize should filter clusters. 
# Current critearia are [ClusterDiversity, ClusterSize], default is ClusterSize
# ClusterDiversity: measures the total number of unique sequences in a cluster.
# ClusterSize: measures the sum of aptamer cardinalities over all cluster members. 
Export.ClusterFilterCriteria = ClusterSize

Default Parameters

By default, all exported data will be written to disk in fastq format (where applicable), including the primer regions, and in gzip compressed form. These default options can be changed with the following parameters

# Whether the resulting files should be gzip compressed or not
Export.compress = true 

# The output format for nucleotide data [fastq, fasta, raw]
Export.SequenceFormat = fastq 

# If false, the 5' and 3' primers will not be exported
Export.IncludePrimerRegions = true 

In what follows, the exact formats in which the individual options are exported will be explained in detail

Exporting the Pool

When AptaSUITE is called with the -export pool option, every unique species of the selection will be exported together with the corresponding frequency for that species in each selection cycle. The sequences are sorted by the sum of each row. Furthermore, several export options exist which can be controlled using the following configuration file parameters

# The format in which the cardinalities of the aptamers should be exported 
# for in each selection cylce. Current options are:
# * count       : exports the raw counts of each aptamer in the corresponding selection cycle
# * frequencies : exports the normalized frequencies, i.e. the raw count divided by the pool size of the particular cycle 
Export.PoolCardinalityFormat = frequencies

Exporting Selection Cycles

Exporting the selection cycles will result in a file with the name SelectionCycle.name Export.SequenceFormat [.gz] in which all aptamers of that round will be written in the specified format. Each individual species will be written as many times as its cardinality in that pool. Headers are constructed as follows @AptaSuite_APTAMERID SelectionCycle.name length=APTAMERSIZE.


Exporting the Cluster Information

For each selection cycles, one file is created which contains all clusters as identified by AptaCLUSTER. Clusters are sorted in descending order according to the criteria defined in Export.ClusterFilterCriteria and members within each cluster are sorted by count.

The format of the resulting file is as follows

>>Cluster_ID Cluster_Size_According_To_Criteria
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT
...
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT

>>Cluster_ID
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT
...
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT

...

>>Cluster_ID
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT
...
>Aptamer_ID
APTAMERSEQUENCE APTAMERCOUNT

Exporting Structure Information

Structure information will be exported for all species in the selection in the following format:

>AptaSuite_APTAMERID
APTAMER_SEQUENCE
HAIRPIN PROBABILITIES (one per nucleotide in aptamer_sequence)
BULDGE LOOP PROBABILITIES (one per nucleotide in aptamer_sequence)
INNER LOOP PROBABILITIES (one per nucleotide in aptamer_sequence)
MULTI LOOP PROBABILITIES (one per nucleotide in aptamer_sequence)
DANGLING END PROBABILITIES (one per nucleotide in aptamer_sequence)