Temporal And diveRsity Distribution Sampler (TARDiS) for Phylogenetics
Download TARDis and make sure dependencies are installed. For the quickest start, just run our example:
[path/to/tardis]/./tardis -s
and the TARDiS explorer GUI will open in your default browser. Retrieve example data in data/example
and click on Run Tardis
.
TARDiS subsamples genetic data sets optimizing genetic diversity and temporal sampling according to parameters set by the users. The optimization is driven by a genetic algorithm.
A paper describing TARDiS principles and application is
- Bionformatics 2021. Simone Marini, Carla Mavian, Alberto Riva, Mattia Prosperi, Marco Salemi, Brittany Rife Magalis, Optimizing viral genome subsampling by genetic diversity and temporal distribution (TARDiS) for phylogenetics, Bioinformatics, 2021;, btab725, https://doi.org/10.1093/bioinformatics/btab725
- Preprint on bioRxiv
In an early, simplified version, TARDiS principles were also applied in:
To run TARDiS, you will need the following inputs:
- A genetic sequence alignment in fasta format (example:
data/example/aln.fa
) - A distance matrix, i.e., a square matrix where you stored the genetic distances for you genetic sequence pairs. It can be a csv or rds file. Rows and columns should be named as the fasta headers (example:
data/example/jc.distance.precalc.csv
). Note that this function works for aligned fasta files - A metadata file in csv format. This file should include two columns,
Accession.ID
, with the fasta headers, andCollection.date
, with the sampling date in the dd/mm/YYYY format (example:data/example/metadata.csv
)
For experimenting with TARDiS, you can run the GUI as explained above. You can retrieve example data in data/example
.
Important: this GUI is intended for experimenting with small sets, and small GA populations. All the GUI outputs are stored as shiny_local/output/jc.distance.precalc.rds
. If you don't have a distance file, the Jukes-Cantor distance
calculates the genetic distance. This distance files is to used as part of the TARDiS input. For larger data sets, please use the command line version instead.
Command line (requires Nextflow)
You can run TARDiS from the command line as follows:
- Run
[path/to/tardis]/tardis [path/to/tardis]/example/example.config
- Profit! You results are in
[current/directory]/output/example
Run tardis -h
to display all available options. Note that to ensure reproducibility of the results, the user can specify the randomization seeds to be used in data/seeds.txt
.
You must specify the parameters for your run in a config file. Use the format of the following example:
params.data_set = "example"
params.nsamples = 4
params.gensize = 100
params.nbatches = 1
params.ncores = 2
params.ngenerations = 10
params.fracnew = 0.14
params.fracevolved = 0.85
params.fracelite = 0.01
params.wdiv = 1
params.wtem = 1
params.distopt = "max"
Settable parameters in the config file are: params.data_set (name of the data set), params.nsamples (number of genomes in the subsample), params.gensize (size of the generation per batch, see below, default 1 batch), params.nbatches (number of batches, see below), params.ncores (number of cores for parallel computing), params.ngenerations (number of generations), params.fracnew (fraction of newly generated individuals per generation), params.fracevolved (fraction of evolved individuals per generation), params.fracelite (fraction of elite individuals per generation; elite individuals are the ones with the highest fitness, to be copied as they are in the new generation), params.wdiv (weight of the genetic diversity), params.wtem (weight of the time distribution), params.distopt (target of genetic diversity optimization, "max", "median", or "mean" of the initial population).
The default NextFlow execution profile (option -p) is "local", which uses the local machine directly. In an HPC environment, you can use the "small" (5 GB), "medium" (30 GB), or "large" (128 GB) profiles, which assume the presence of the SLURM scheduler.
TARDiS can be run in group mode (option -g) from the command line. This is useful when the user has a large genomes file, with genomes pertaining to different groups, to be subsampled independently and combined into a single output at the end. For example, groups could correspond to different geographical regions. In this case, the config file will be in comma-delimited format, with one row for each group. Note that for group mode config files:
- Column names are parameter names
- A special group column, called
group
, needs to be present, with a different value in each line - The execution profile can be specified in the group mode file config in the "profile" column (this is not possible in a single run config file
NAs
are accepted (default values will be used)- Groups that are NOT listed in the group-mode config file will be included in the final output as a whole, without being subsampled
- The process will also create separate folders, one per group, with the sliced group files: alignment, distance, and metadata.
Also note that the metadata file needs to include a group
column to identify the group of each genome.
A working toy example is provided in data/example_group
. There are three groups, a, b, and c. While groups a and b will be subsampled, the whole group c (absent from the config file) will be included in the output without being subsampled.
To run the aforementioned group example, run
./tardis -g data/example_group/parameters.group.csv -m data/example_group/metadata.group.csv -a data/example_group/aln.group.fa
To ease the calculation burden for large populations, data can be split into batches. Remember that params.gensize defines the number of individuals per batch, while params.nbatches
defines the number of batches. So to have a 500K population split into 50 batches of 10K individuals each, you can set params.gensize = 10000
, and params.nbatches = 50
. Note that this will submit 50 jobs (1 per batch) for each generation to your workload manager. When all jobs in the first generation are complete, the 50 jobs for the next generation will be submitted, and so on.
To run TARDis, please install
- R >= 3.6.1
- Python >= 3.7 (works with Python 2.7 as well)
- optparse >= 1.6.6
- doRNG >= 1.8.2
- dplyr >= 1.0.0
- ggplot2 > 3.3.1
- gridExtra >= 2.3
For the GUI/explorer version, please install
For the local/hpc command line version, please install
- Nextflow >= 20.01.0 and <= 22.10.6 Important: Due to changes in the newer versions of Nextflow, TARDiS will not work with versions 23+.
To calculate Jukes-Cantor distances
TARDiS has been successfully used on Linux Ubuntu (local, commandline), Chrome (GUI) and SLURM (hpc). Please let use know if you are using it on other platforms.
To known more about TARDiS principles and practical application, check out the TARDiS presentation at the VEME Workshop 2021.
TARDiS is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.