-
Notifications
You must be signed in to change notification settings - Fork 187
Fast Parallel UniFrac
A common tool in microbial ecology studies involving many samples is to calculate the "UniFrac" distance between all pairs of samples, and then perform various analyses on the resulting distance matrix.
The phyloseq package includes a native R
implementation of both the original UniFrac algorithm, as well as the better, faster, cleaner "Fast UniFrac" algorithm. Both approaches arrive at the same result. There are also two very different types of UniFrac calculation:
Weighted UniFrac - which does take into account differences in abundance of species between samples, but takes longer to calculate; and
Unweighted UniFrac - which only considers the presence/absence of species between sample pairs.
Both can be useful, and share slightly different insight. Both weighted and unweighted UniFrac are included, and all UniFrac calculations have the option of running "in parallel" for faster results on computers that have multiple cores/processors available.
In phyloseq, all variants of the UniFrac distance can be called with the same generic function:
UniFrac(physeq, weighted=TRUE, normalized=TRUE, parallel=FALSE, fast=TRUE)
where physeq
is your phyloseq-class
experiment-level object, imported or constructed by the phyloseq package.
Here is an example taken from the UniFrac
documentation showing how to calculate it for the 3-sample "esophagus" dataset:
################################################################################
# Perform UniFrac on esophagus data
################################################################################
data("esophagus")
(y <- UniFrac(esophagus, TRUE))
UniFrac(esophagus, TRUE, FALSE)
UniFrac(esophagus, FALSE)
picante::unifrac(as(t(otuTable(esophagus)), "matrix"), tre(esophagus))
The following graphic shows the computation time required to calculate the distance matrix for a 21-sample experiment at varying numbers of species. As you can see, for datasets with less than 1000 species (or more accurately, "OTUs"), the choice of algorithm has only inconsequential effect on computation time, with all approaches returning a result in under 10 seconds. However as more and more species are included (more complex phylogenetic tree) in the dataset, the computation time increases exponentially; and at a much faster rate for variants of the original UniFrac algorithm. As pointed out in the article describing Fast UniFrac, there is much less difference in performance between unweighted/weighted Fast UniFrac, making weighted UniFrac distances much more accessible (than previously) for large datasets. Also note that the computation time for all UniFrac methods will increase combinatorically as the number of samples increases. You can quickly note the number of pairwise UniFrac calculations necessary with the following:
dim(combn(nsamples(physeq), 2))[2]
If you want to test the speed on your system/data, here is some example code for randomly subsampling your experiment:
# Define the number of species/samples in your randomly subsampled experiment
N <- 300 # species
M <- 20 # samples
# Randomly subset N species from the original
ex <- prune_species(sample(species.names(physeq), N), physeq)
# Randomly subset M samples from the original
ex <- prune_species(sample(sample.names(ex), M), ex)
# Pick a random root from the sub-sampled dataset
tree <- root(tre(ex), sample(species.names(ex), 1), resolve.root=TRUE)
OTU <- otuTable(ex)
# Rebuild to make sure you didn't screw up something.
ex <- phyloseq(OTU, tree)
# Calculate the UniFrac dist matrix and print the elapsed system time
system.time(UniFrac(ex, weighted=TRUE))["elapsed"]
Computation time required to calculate the distance matrix for a 21-sample experiment at varying numbers of species. For comparison, the performance of the original unweighted UniFrac algorithm as implemented in the "picante" package is also included.