-
Notifications
You must be signed in to change notification settings - Fork 12
Clustering aptamers by sequence similarity
AptaSUITE implements a Java version of AptaCluster which allows for an efficient clustering of whole HT-SELEX aptamer pools; a task that could not be accomplished with traditional clustering algorithms due to the enormous size of such datasets.
Our approach is centered around a randomized dimensionality reduction technique, known as locality-sensitive hashing (LSH). By first representing aptamers as non-redundant species and their corresponding frequency counts, we then apply a user-defined number of randomized locality-sensitive hash functions to the data set in order to distinguish sequence pairs that are potentially similar from those that are, with very high probability, not similar. Each function operates by selecting a small number of nucleotide positions from each aptamer and treats the substring, resulting from the concatenation of these bases, as input for the hashing procedure. Hence, aptamers with highly similar primary structure are likely to fall into the same group whereas dissimilar sequences rarely produce identical hash values. In the third step, the actual clustering step, we compute precise sequence distances between aptamers of identical hash value, while the distances between the aptamers never encountered in the same group are set to infinity.
Use the Aptamer Family Analysis
tab to cluster the data into sets of aptamers related to each other by sequence similarity.
Assuming that data was previously imported (or simulated), AptaCluster can be called with the following command within AptaSUITE:
java -jar aptasuite.jar -cluster -config /path/to/configuration/file
The minimal information required for AptaCLUSTER to run is the randomized region size of the aptamers to be clustered as well as the locality sensitive hashing dimension defining the whithin-cluster similarity.
# Defines the aptamers to be clustered. All species in the database which differ in size
# from the specified value will be ignored from the clustering process.
Aptacluster.RandomizedRegionSize = 40
# The size of the locality sensitive hash dimension. It defines how many indices from the
# randomized region will be sampled during the process. Must be smaller or equal to
# Aptacluster.RandomizedRegionSize
Aptacluster.LSHDimension = 30
By default, AptaCLUSTER performs 5 rounds of LSH while allowing a maximal distance of 5 mutations between the seed sequence of a cluster and its additional members. For the exact distance computation, a kmer size of 3 is used. These parameters can be adjusted as follows
# The maximal number of nucleodite differences between two sequences to be
# considered members of the same cluster
Aptacluster.EditDistance = 5
# The number of LSH iterations to be performed
Aptacluster.LSHIterations = 5
# The kmer size used for the distance calculations
Aptacluster.KmerSize = 3
# The Aptacluster.EditDistance value will be empirically converted from the edit distance
# space into the kmer distance space. This value controlls the number of iterations
# to be performed for computing the kmer cutoff for cluster formation
Aptacluster.KmerCutoffIterations = 10000