Skip to content

refresh-bio/clusty

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clusty

Bioconda downloads GitHub downloads Build and tests License

x86-64 ARM Apple M Windows Linux macOS

Clusty is a tool for large-scale clustering. By using sparse distance matrices it allows clustering data sets with millions of objects.

Quick start

git clone --recurse-submodules https://github.com/refresh-bio/clusty
cd clusty
gmake -j

cd ./test

# Run single linkage clustering on the pairwise similarities stored in ictv.ani file, output cluster identifiers
# (two first columns are used as sequence identifiers, third one is assumed to store similarities).
../bin/clusty --algo single --similarity ictv.ani ictv.single

# Run uclust clustering accepting pairwise connectios with values greater or equal 0.70 in the ani column, output cluster representatives.
../bin/clusty --algo uclust --similarity --min ani 0.70 --out-representatives ictv.ani ictv.uclust.70

# Run CD-HIT clustering accepting pairwise connectios with values greater or equal 0.95 in the ani column, output cluster identifiers
# (use id2 and id2 columns as object identifiers and ani column as the similarity).
../bin/clusty --algo cd-hit --similarity --min ani 0.95 --id-cols id2 id1 --distance-col ani vir61.ani vir61.single.95

# Run complete linkage clustering, consider all objects from ictv.list file (including those without pairwise connections).
../bin/clusty --algo complete --objects-file ictv.list --similarity ictv.ani ictv.complete

Installation

Clusty comes with a set of precompiled binaries for Windows, Linux, and macOS. They can be found under Releases tab. The software is also available on Bioconda:

conda install -c bioconda clusty

For detailed instructions how to set up Bioconda, please refer to the Bioconda manual.

The package can be built from the sources distributed as:

  • Visual Studio 2022 solution for Windows,
  • GNU Make project for Linux and macOS (gmake 4.3 and gcc/g++ 10 or newer required)

To compile Clusty under Linux/macOS please run:

gmake -j 

Leiden algorithm

Clusty provides igraph's implementation of the Leiden algorithm. Precompiled binaries as well as bioconda distributions include Leiden algorithm. However, as igraph requires several external dependencies (CMake 3.18, Flex, Bison), it is by default not linked to the Clusty software. To install dependencies under Debian/Ubuntu linux use the following command:

sudo apt-get install cmake flex bison

Then, one needs to build the package with an additional option enabled:

gmake -j LEIDEN=true

Under Windows, Clusty is by default linked against igraph and it requires CMake as the only system dependency. After installing it (https://cmake.org) a user can run build_igraph.bat batch script which downloads Flex and Bison binaries to the appropriate locations and then builds igraph. After that it is possible to build Clusty using Visual Studio (the solution is located in ./src/clusty.sln).

Usage

clusty [options] <distances> <assignments>

Parameters:

  • <distances> - input TSV/CSV table with pairwise distances
  • <assignments> - output TSV/CSV file with cluster assignments

Options:

  • --objects-file <string> - optional TSV/CSV file with object names in the first column sorted decreasingly w.r.t. representativness

  • --algo <single | complete | uclust | set-cover | cd-hit | leiden> - clustering algorithm:

    • single - single linkage (connected component, MMseqs mode 1)
    • complete - complete linkage
    • uclust - UCLUST
    • set-cover - greedy set cover (MMseqs mode 0)
    • cd-hit - CD-HIT (greedy incremental; MMseqs mode 2)
    • leiden - Leiden algorithm
  • --id-cols <column-name1> <column-name2> - names of columns with object identifiers (default: two first columns)

  • --distance-col <column-name> - name of the column with pairwise distances (or similarities; default: third column)

  • --similarity - use similarity instead of distance (has to be in [0,1] interval; default: false)

  • --percent-similarity - use percent similarity (has to be in [0,100] interval; overrides --similarity flag, default: false)

  • --min <column-name> <real-threshold> - accept pairwise connections with values greater or equal a given threshold in a specified column

  • --max <column-name> <real-threshold> - accept only pairwise connections with values lower or equal a given threshold in a specified column

  • --numeric-ids - use when sequences in the distances file are represented by numbers (can be mapped to string ids by the object file)

  • --out-representatives - output representative objects for each cluster instead of cluster numerical identifiers

  • --out-csv -- output a CSV table instead of a default TSV

  • -t - number of threads (default: 4)

Leiden algorithm options:

  • --leiden-resolution - resolution parameter for Leiden algorithm (default: 0.7)
  • --leiden-beta - beta parameter for Leiden algorithm (default: 0.01)
  • --leiden-iterations - number of interations for Leiden algorithm (default: 2)

Examples

Basic use case

The minimum input requirement is a TSV/CSV table with pairwise distances between objects (or similarities, if --use-similarity flag is used). By default, identifiers are assumed to be in the two first columns while distances are expected in the third one. Lack of a distance for a given pair of objects is translated to infinite distance. The example input table is given below:

name1	name2	ani
xxx	xx	0.93
aaa	aa	0.94
aaa	a	0.92
xx	x	0.94
bb	b	0.71
aa	a	0.89
b	bb	0.99

If the table is organized differenty, one can specify columns with ids and distances by using --id-cols and --distance-col parameters. The algorithm produces as a result a TSV table (or CSV if --out-csv flag is specified) with object identifiers followed by 0-based numerical identifiers of clusters. The clusters are ordered decreasingly by size with objects within a cluster outputed with decreasing representativeness (by default: in a lexigraphical order).

object	cluster
x	0
xx	0
xxx	0
a	1
aa	1
aaa	1
b	2
bb	2

Cluster representatives

Instead of numerical identifiers, the package can output a representative object for every cluster using --out-representatives flag. The representative is the first object on the list that was assigned to the cluster:

object	cluster
x	x
xx	x
xxx	x
a	a
aa	a
aaa	a
b	b
bb	b

Objects file

An optional TSV/CSV file specified by --objects-file parameter contains a complete list of analyzed objects. This file can be useful when distance table does not contain all the object (e.g., due to some filtering). Additionally, this file overrides representativeness of object in the ouput table. Thus, it affects ordering of objects within a cluster and their representative when --out-representatives flag is used. For instance, if one specifies the following objects file:

object
aaa
aa
a
bb
b
c
d
e
f
g
xxx
xx
x

the following output will be produced:

object	cluster
aaa	0
aa	0
a	0
xxx	1
xx	1
x	1
bb	2
b	2
c	3
d	4
e	5
f	6
g	7

or with --out-representatives flag:

object	cluster
aaa	aaa
aa	aaa
a	aaa
xxx	xxx
xx	xxx
x	xxx
bb	bb
b	bb
c	c
d	d
e	e
f	f
g	g

Numerical identifiers

Clusty also supports numerical identifiers (--numeric-ids flag), e.g., as in table below. Note: the identifiers are assumed to be ordinal numbers (they do not have to be continous, though - gaps are allowed) rather than numerical hashes. This is because the software allocates memory area proportional to the size of the largest identifier.

id1	id2	ani
10	11	0.93
0	1	0.94
0	2	0.92
11	12	0.94
3	4	0.71
1	2	0.89
4	3	0.99
5	6	0.33

The clustering produces the following output (lowest object id translates to the highest representativeness):

object	cluster
10	0
11	0
12	0
0	1
1	1
2	1
3	2
4	2

An external object file can also be used in this mode. It assumes, that the numerical identifier in the distance table translates to the 0-based index in the object file. The previously mentioned object file combined with --out-representatives flag produces the same table as the one generated without --numeric-ids switch:

object	cluster
aaa	aaa
aa	aaa
a	aaa
xxx	xxx
xx	xxx
x	xxx
bb	bb
b	bb
c	c
d	d
e	e
f	f
g	g

Algorithms

In the following section one can find detailed information on clustering algorithms in Clusty, with n representing the number of objects (vertices) and e the number of distances (edges) in the data set (graph).

clustering-steps

Citation

Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv [doi:10.1101/2024.06.27.601020].