Home

PhyloPhlAn 3

PhyloPhlAn is an integrated pipeline for large-scale phylogenetic profiling of genomes and metagenomes.

Most likely the easiest way to understand how you can use PhyloPhlAn in your analysis is to check out the examples in the PhyloPhlAn tutorial.

Installation

There are two installation methods available, we recommend you use the Conda-based ones to guarantee that all PhyloPhlAn dependencies will be automatically satisfied.

Conda package [easy]

This requires a working Conda installation.

conda install -c bioconda phylophlan

Note 1: we recommend you install PhyloPhlAn in a new, dedicated environment so that all dependencies will be properly resolved by conda. This can be easily done with:

conda create -n "phylophlan" -c bioconda phylophlan=3.1.1

Note 2: for generating the four default configuration files, after the installation please execute:

phylophlan_write_default_configs.sh [output_folder]

Repository from GitHub [hard]

Step 1: Get the PhyloPhlAn from the GitHub repository

This requires git.

git clone https://github.com/biobakery/phylophlan
cd phylophlan
python setup.py install

Step 2: Install the Dependencies and Tools necessary to run PhyloPhlAn

Test PhyloPhlAn installation

To verify that PhyloPhlAn is properly installed, you can execute the following command:

phylophlan --version

that should output:

PhyloPhlAn version 3.1.68 (6 March 2024)

Note: The above version number and date might be different according to the version you have installed.

Citation

If you used PhyloPhlAn please cite the following paper:

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0
Francesco Asnicar, Andrew Maltez Thomas, Francesco Beghini, Claudia Mengoni, Serena Manara, Paolo Manghi, Qiyun Zhu, Mattia Bolzan, Fabio Cumbo, Uyen May, Jon G. Sanders, Moreno Zolfo, Evguenia Kopylova, Edoardo Pasolli, Rob Knight, Siavash Mirarab, Curtis Huttenhower, and Nicola Segata
Nat Commun 11, 2500 (2020)
DOI: https://doi.org/10.1038/s41467-020-16366-7

Basic usage

phylophlan -i <input_folder> \
    -d <database> \
    --diversity <low-medium-high> \
    -f <configuration_file>

where:

<input_folder> is the folder containing your input genomes and/or proteomes, a detailed description is available here
<database> is the name of the database of markers to use, a detailed description is available here
--diversity takes value in {low, medium, high} and it's used to automatically set the analysis to the type of phylogeny to build, a detailed description is available here
<configuration_file> is the path to the configuration file necessary to properly run PhyloPhlAn 3, a detailed description is available here

Input Files

PhyloPhlAn 3 takes FASTA files (also compressed in Gzip, .gz and/or Bzip2, .bz2) as input. Inputs can be both genomes and proteomes, also mixed, and by default genomes and proteomes are distinguished by the .fna and .faa extension, respectively.

If needed, genomes and proteomes file extensions can be specified using the --genome_extension and --proteome_extension params, respectively.

Nucleotide or Amino acid

When using PhyloPhlAn 3, the user can customize each step of the pipeline to build the tree (marker genes identification, multiple sequence alignment, concatenation or gene trees inference, and phylogeny reconstruction) by specifying the desired tools in the configuration file. These steps should be tuned according to the type of markers present in the database and the input used in the analysis:

when both markers and inputs are nucleotides, the phylogenetic analysis will be done on nucleotides and the configuration file should specify the tools and params to work with nucleotides
when markers are proteins and inputs a mix of genomes and proteomes, it will proceed in translated sequence space, so amino acids. If the inputs are all genomes, the user can decide to specify the --force_nucletides parameter to perform the phylogenetic analysis on nucleotides. The configuration file should be created using the --force_nucleotides parameter with the phylophlan_write_config_file script.

Diversity

The --diversity parameter allows for three pre-defined options to set several parameters at once (e.g., trimming, subsampling, fragmentary removal, etc.) in accordance with the expected diversity of the phylogeny to be built.

The user can choose among three values:

Diversity	Description
`low`	for species- and strain-level phylogenies
`medium`	for genus- and family-level phylogenies
`high`	for tree-of-life and higher-ranked taxonomic levels phylogenies

Accurate or Fast

If not specified, PhyloPhlAn 3 will automatically run with the --accurate option, which will consider more phylogenetic positions and should result in a more accurate phylogenetic reconstruction.

The --fast option can be specified to have a faster phylogenetic pipeline.

Both options will affect several other parameters that depend on the --diversity parameter. A detailed description is available here.

Output

All files produced by PhyloPhlAn 3 are available in the <input_folder>_<database> folder (or in the folder specified with --output_folder).

Inside there is a temporary folder (<input_folder>_<database>/tmp) that contains all the intermediate and temporary files produced during the analysis.

Depending on the configuration file and hence on the type pf phylogenetic analysis performed, the resulting output files may have different names.

For instance, using the supermatrix_aa.cfg configuration file that can be automatically generated using the phylophlan_write_default_configs.sh script, the output files will be:

Filename	Description
RAxML_bestTree.input_folder_refined.tre	is the final (refined) phylogeny produced by RAxML starting from the FastTree phylogeny
input_folder.tre	is the phylogeny built by FastTree
input_folder.aln	is the multiple sequence alignment used as input for the phylogenies, in FASTA format

Parallel computations

The user can specify the number of CPUs to use with the --nproc parameter:

phylophlan -i <input_folder> \
    -d <database> \
    --diversity <low-med-high> \
    -f <configuration_file> \
    --nproc <N>

Please note that regardless of the number of CPUs specified with --nproc, PhyloPhlAn 3 will run:

RAxML with no more than 20 CPUs in the case --nproc is greater than 20 as in our experience using more than 20 CPUs with RAxML does not shorten the computational time required for the phylogeny reconstruction.
FastTree with 3 CPUs (as suggested in the FastTree FAQs) but this is not regulated by the --nproc param because FastTree uses the OMP_NUM_THREADS variable, which is defined in the configuration file.

Note: if you specify with --nproc a higher number of CPUs compared to the ones available in your machine, you will experience a significant drop in performances, as also reported in the RAxML manual.

Databases

PhyloPhlAn 3 is able to automatically download two databases of universal markers for prokaryotes:

PhyloPhlAn (-d phylophlan, 400 universal marker genes) presented in Segata, N et al. NatComm 4:2304 (2013)
AMPHORA2 (-d amphora2, 136 universal marker genes) presented in Wu M, Scott AJ Bioinformatics 28.7 (2012)

Moreover, in addition to the two databases provided, as explained in the following database setup section, it is possible to retrieve a set of core proteins of a specific species, or even build custom databases starting from either a folder containing marker files or a multi-fasta file containing the marker sequences (e.g., multi-fasta file with the core genes sequences from Roary).

Offline installation

If you wish to download the databases and make them available offline, you can follow one of the following options:

Option 1. The easiest thing to do is to run phylophlan from a machine with an internet connection specifying the database you want to use and the location where to store it using the --databases_folder param.

phylophlan [mandatory_params] -d phylophlan --databases_folder /my/databases/folder --verbose
phylophlan [mandatory_params] -d amphora2 --databases_folder /my/databases/folder --verbose

Note: You can kill the runs above as soon as the database is downloaded and set up.

Option 2. Download the phylophlan_databases.txt file and then download the files listed inside it and put them inside /my/databases/folder:

Note 1: The following commands assume that the current working directory is /my/databases/folder.

You can verify the md5 checksums of the .tar archives and compare them with those in the .md5 file just downloaded:

diff <(md5sum amphora2.tar) amphora2.md5
diff <(md5sum phylophlan.tar) phylophlan.md5

Then you need to untar the tar files and decompress their contents:

tar -xf amphora2.tar
bzcat amphora2/*.bz2 > amphora2/amphora2.faa

tar -xf phylophlan.tar
bunzip2 -k phylophlan/phylophlan.bz2

Finally you can index the databases, but in doing so you should make sure you use the very same version you'll specify in the PhyloPhlAn configuration file when running your phylogenetic analysis. For instance, if you are going to use diamond you can index the databases with the following commands:

diamond makedb --in amphora2/amphora2.faa --db amphora2/amphora2
diamond makedb --in phylophlan/phylophlan.faa --db phylophlan/phylophlan

Note: Thanks to Eric Deveaud for the suggestions in putting this section together.

Expert usage

In this section, we provide as many details as possible for the parameters and configurations available in PhyloPhlAn 3.

Quality control of inputs and phylogenetic markers

When building a phylogeny, PhyloPhlAn 3 makes sure that input genomes/proteomes and markers respect a certain threshold of quality. It is possible to customize these thresholds through the two following parameters:

--min_num_proteins <n>: used to discard proteomes (.faa) with less than the specified number of proteins. Default is 1.
--min_len_protein <n>: this parameter is associated with the previous one and it is used to specify the minimum length of a protein in the proteomes. Proteins shorter than this value will not be considered. Default is 50.

The above two parameters have no effect when inputs are only genomes, see this section for more information.

--min_num_markers <n>: input genomes or proteomes that map to less than the specified number of markers will be discarded. Default is 1, unless the database specified with -d is phylophlan or amphora, in these cases default is respectively 100 and 34.
--min_num_entries <n>: database markers that are found in less than the specified number of inputs will be discarded. Default is 4.
--remove_fragmentary_entries: if specified, the multiple sequence alignment (MSA) will be checked and cleaned from fragmentary entries. See --fragmentary_threshold for the threshold values above which an entry will be considered fragmentary. Default is False.
--fragmentary_threshold <n>: used to specify the fraction of gaps for each input in the MSA to be considered fragmentary and hence removed. Default is 0.85.
--remove_only_gaps_entries: if specified, entries in the MSAs composed only of gaps will be removed. This is equivalent to specifying --remove_fragmentary_entries and --fragmentary_threshold 1. Default is False.

Accurate or Fast

The following table shows the parameters affected by the combination of the --diversity and --accurate/--fast parameters.

	`--accurate`	`--fast`
`--diversity low`	`--submat pfasum60` `--trim not_variant` `--remove_fragmentary_entries` `--not_variant_threshold 0.99`	`--submat pfasum60` `--trim greedy` `--remove_fragmentary_entries` `--fragmentary_threshold 0.85` `--subsample fivehundred` `--scoring_function trident` `--gap_perc_threshold 0.67`
`--diversity medium`	`--submat pfasum60` `--trim gap_trim` `--remove_fragmentary_entries` `--fragmentary_threshold 0.85` `--subsample onehundred` `--scoring_function trident`	`--submat pfasum60` `--trim greedy` `--remove_fragmentary_entries` `--fragmentary_threshold 0.75` `--subsample fifty` `--scoring_function trident` `--not_variant_threshold 0.97` `--gap_perc_threshold 0.75`
`--diversity high`	`--submat pfasum60` `--trim greedy` `--remove_fragmentary_entries` `--fragmentary_threshold 0.75` `--subsample twentyfive` `--scoring_function trident` `--not_variant_threshold 0.95` `--gap_perc_threshold 0.85`	`--submat pfasum60` `--trim greedy` `--remove_fragmentary_entries` `--fragmentary_threshold 0.67` ( `--subsample phylophlan` or `--subsample tenpercent` ) `--scoring_function trident` `--not_variant_threshold 0.9` `--gap_perc_threshold 0.85`

Note: if you manually specify in the command line one or more of the above parameters, that will override the automatic value for the specific combination of --diversity and --accurate/--fast for that parameter(s).

Trimming

You can specify the trimming strategy to use with the --trim parameter. The user can choose between four different options:

`--trim`	Description
`gap_trim`	will perform what specified in the `trim` section of the configuration file, which by default is trimAl with the `--gappyout` parameter, as presented in Capella-Gutiérrez S, et al. Bioinformatics 25.15 (2009) and in the trimAl website
`gap_perc`	remove columns with a percentage of gaps above a certain threshold, regulated by the `--gap_perc_threshold` parameter, whose default value is 0.67
`not_variant`	removes columns from a multiple-sequence aligned file that has at least one amino acid appearing above a certain threshold set by the `--not_variant_threshold` parameter, whose default value is 0.99
`greedy`	performs all the above trimming options

The default is None, the trimming step will not be performed.

Subsampling

Site subsampling strategy allows retaining only a certain amount of phylogenetically relevant positions (selected based on the scoring function).

In PhyloPhlAn 3, you can specify the subsample strategy using the --subsample parameter.

There are several options available that will set a different amount of retained positions:

`--subsample`	Description
`phylophlan`	uses the formula presented in Segata, N et al. NatComm 4:2304 (2013) to determine how many positions to retain for each of the 400 PhyloPhlAn markers
`onethousand`	retains up-to 1,000 positions for each marker
`sevenhundred`	retains up-to 700 positions for each marker
`fivehundred`	retains up-to 500 positions for each marker
`threehundred`	retains up-to 300 positions for each marker
`onehundred`	retains up-to 100 positions for each marker
`fifty`	retains up-to 50 positions for each marker
`twentyfive`	retains up-to 25 positions for each marker
`tenpercent`	retains 10% of the positions for each marker
`twentyfivepercent`	retains 25% of the positions for each marker
`fiftypercent`	retains 50% of the positions for each marker

Note: the --subsample phylophlan option works only when using the PhyloPhlAn database, specified via -d phylophlan

The default is None. In this case, the subsampling will not be performed and the full-length alignment will be used.

Scoring function

In PhyloPhlAn 3, a scoring function is used to assign a phylogenetic score to each column in the MSAs, that will be then used to rank the MSA positions to retain a subset of them (see Subsampling).

The --scoring_function parameter allows three different scoring functions:

`--scoring_function`	Description
`muscle`	implements the same scoring function defined in Edgar, RC NAR 32.5 (2004), when specifying the `-scorefile` param
`trident`	implements the `trident` scoring function as presented in Valdar, WSJ. Proteins 48.2 (2002), which is a weighted combination of symbol diversity, stereochemical diversity, and gap cost
`random`	assigns random scores to each position in the MSAs (for testing purposes only)

Substitution matrices

Some of the functions for scoring the MSA columns need a substitution matrix to evaluate the expected substitution rates of amino acids.

Substitution matrices can be specified using the --submat param that could assume one of the following values:

`--submat`	Description
`vtml200`	substitution matrix proposed by Yamada K, Tomii K Bioinformatics 30.3 (2014)
`vtml240`	substitution matrix used in Edgar RC NAR 32.5 (2004)
`miqs`	substitution matrix proposed by Tomii K and Kazunori Y Humana Press, New York, NY, 1415 (2016)
`pfasum60`	substitution matrix proposed by Keul F et al. BMC Bioinformatics 18.1 (2017)

The substitution matrices presented above are distributed within PhyloPhlAn 3. However, the set of substitution matrices could be extended with user-defined ones. The user can generate its own substitution matrices using the scripts (generate_matrices.sh and serialize_matrix.py) provided into the phylophlan_substitution_matrices folder.

Substitution models

If you are running a gene tree pipeline you have to specify also the --maas parameter providing a mapping file that specifies the substitution model to use for each specific marker. Within PhyloPhlAn you can find the phylophlan.tsv file (present inside the phylophlan_substitution_models folder) that lists the substitution models for each of the 400 universal markers of the PhyloPhlAn database.

The format of the file is very simple, it should be a two-columns file separated by TAB, where in the first column you specify the name of the marker and in the second the name of the substitution model to use.

For example, the first 5 lines of the phylophlan.tsv file:

p0000	PROTCATLG
p0001	PROTCATLG
p0002	PROTCATLG
p0003	PROTCATLG
p0004	PROTCATLG
p0005	PROTCATCPREVF

Mutation rates table

PhyloPhlAn 3 implements the --mutation_rates option that computes the amount of nucleotide or amino acid changes in each aligned marker.

In the output folder <input_folder>_<database>/mutation_rates/, you can find a mutation rate table for all the markers whereas the <input_folder>_<database>/mutation_rates.tsv output file contains the summarized mutation rates table for the complete multiple sequence alignment.

The upper-triangular of the mutation rates table contains the decimal value of the mutation rate (e.g., 0.01), while the lower-triangular contains the fraction (e.g., 1/100), which can be used to evaluate if the value is computed over a meaningful number of positions w.r.t. the length of the MSA.

Sorting

Using the --sort parameter it is possible to sort the markers and hence force PhyloPhlAn 3 to consider them in a specific order when concatenating the sequences.

When using the PhyloPhlAn database (-d phylophlan), --sort will be automatically set to True.

Note: the sort preference is used only for the super-matrix approach (concatenation).

Database setup

To build a custom database, we provide the phylophlan_setup_database script to be run with the following syntax:

phylophlan_setup_database -i <input_file_or_folder> \
    -d <database_name> \
    -e <input_extension> \
    -t <database_type>

where:

<input_file_or_folder>: is the folder containing markers' files or a multi-fasta file containing the markers
<database_name>: is the database name chosen by the user (the name to provide to phylophlan when running it)
<input_extension>: is the extension of the input file(s)
<database_type>: has to be n if the user is using a nucleotide database or a if the user is using an amino acids database

The database will be created in the same folder of the input file(s), or you can specify an output folder with the -o option.

The phylophlan_setup_database script can also be used to automatically retrieve a set of core proteins of a specific species using the -g option (instead of the -i param). In this case, you need to specify the species name like -g s__<species_name>. This is also going to be the default name of the database if not differently specified with -d.

In this case, a set of UniRef90 species-specific proteins for the species_name provided will be downloaded. As UniRef90 IDs might change in time, you might see failed downloads in the output of the program for some of the proteins. The phylophlan_setup_database script will save them and re-try the download by using the UniRed APIs to resolve the old IDs into the new ones. If also the second attempt fails to download some of the UniRef90 proteins, those will be reported in the <species_name>_core_proteins_not_mapped.txt file, saved inside the database folder.

Configuration File

PhyloPhlAn 3 relies on the configuration file for handling the external software and their parameters.

A configuration file can be specified in phylophlan with -f <config_file>.

A configuration file is composed of different sections (some are mandatory and needed to ensure to execute the minimum steps in the pipeline to complete a phylogenetic analysis, and some are optional). Each section refers to a specific step in the phylogenetic pipeline and contains all the details for the external software to be correctly executed.

In PhyloPhlAn 3 you can find the phylophlan_write_default_configs.sh script that will generate four ready-to-use configuration files:

supermatrix_aa.cfg
supermatrix_nt.cfg
supertree_aa.cfg
supertree_nt.cfg

More information about the supermatrix and supertree approaches are available in the following section.

Custom configuration file

If you want to generate your own configuration file, you can use the phylophlan_write_config_file script.

Below is an example of the command used to create a customized configuration file where diamond is used instead of blastn and muscle instead of mafft, with respect to the supermatrix_nt.cfg configuration file generated by the phylophlan_write_default_configs.sh script:

python phylophlan_write_config_file \
    -o custom_config_nt.cfg \
    -d n \
    --db_dna makeblastdb \
    --map_dna diamond \
    --msa muscle \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml

where:

-o: is the output filename
-d: indicates the type of database this configuration file is tailored for (a detailed description is available here )
--db_dna, --map_dna, --msa, --trim, --tree1, --tree2: indicate the sections the configuration file will contain

Note 1: Please note that if you are going to use MAFFT and in your system either /local-storage or /tmp is available, it will be used for the temporary files by exporting the TMPDIR variable. If you want to change the temporary folder for MAFFT please add to (or edit) your config file under the [msa] section where MAFFT is specified:

environment = TMPDIR=/path/to/temp/folder

Note 2: Please note that if you specified fasttree in your configuration file the number of CPUs will be set to 3 as suggested in the FastTree FAQs. If you want to change the number of CPUs for fasttree you can add (or edit) your config file under the [tree1] section where fasttree is specified:

environment = OMP_NUM_THREADS=3

Note 3: Please, if you are going to use DIAMOND in your analysis, be aware that there are known issues.

Note 4: Please, if you are going to use MAFFT in your analysis, be aware that there are known issues.

Mandatory sections

The following sections are strictly required in any configuration file:

Mandatory section	Description
`--db_dna` and/or `--db_aa`	specify the command to use for creating and indexed database; choices for `db_dna`: `makeblastdb`; choices for `db_aa`: `usearch`, `diamond`
`--map_dna` and/or `--map_aa`	specify the software for mapping the database against genomes and proteomes, respectively; choices for `map_dna`: `blastn`, `tblastn`, `diamond`; choices for `map_aa`: `usearch`, `diamond`
`--msa`	specify the software for performing the multiple-sequence alignment; choices are: `muscle`, `mafft`, `opal`, `upp`
`--tree1`	specify the software for inferring the phylogeny; choices are: `fasttree`, `raxml`, `iqtree`, `astral`, `astrid`

Optional sections

Optional section	Description
`--trim`	specify the software `trimal` for performing the trimming of gappy regions
`--gene_tree1`	specify the software to use for building the single-gene trees. Choices are `fasttree`, `raxml`, `iqtree`
`--gene_tree2`	specify the software `ramxl` for refining the phylogenies built at the `gene_tree1` step
`--tree2`	specify the software `raxml` for refining the phylogeny built at the `tree1` step

Integrating new tools in the framework

PhyloPhlAn 3 allows users to integrate new tools that are not available in the framework, as well as their parameters, for each of the different steps. This is done by manually editing the configuration file or creating a new configuration file with the desired tools/parameters.

Important: The only requirement for this integration is that the input and output files of the tool to be integrated are in the same format used by PhyloPhlAn.

Here is an example section of a default supermatrix configuration file that uses MAFFT for MSA:

[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#

And here is the same section manually modified to use Clustal Omega that is not a default option in PhyloPhlAn 3 for the MSA:

[msa]
program_name = clustalo
input = -i
output = -o
params = --threads 1 --auto
version = --version
command_line = #program_name# #params# #input# #output#

Configuration variables explained

A configuration file can be composed of several different sections, but there is a minimum set of sections that has to be present to complete a phylogenetic analysis.

The mandatory sections are:

either map_dna and/or map_aa
msa
tree1

The complete list of available sections is:

map_dna
map_aa
msa
trim
gene_tree1
gene_tree2
tree1
tree2

Each of the above sections can have several different options specified.

These are required in order to compose a command line that can run an external tool.

The set of mandatory options that each of the sections in a configuration file has to specify are:

program_name
command_line

The complete list of available options is:

program_name
params
threads
input
database
output_path
output
version
environment
command_line

In particular, the command_line option specifies how the other options should be arranged in order to build a running command line. For instance, taking the following section of a configuration file as an example:

[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#

In the command_line option it is specified that there should be the information provided in the program_name option as the first element, followed by the information in the params option and the information about the input option. After the input option, there is the output redirect sign (>) followed by the output option.

Note 1: if no input option is specified, PhyloPhlAn 3 will read the input from the standard input.

Note 2: if no output option is specified, PhyloPhlAn 3 will redirect the output to the output file.

Supermatrix or Supertree approach

PhyloPhlAn 3 allows performing either a Supermatrix (or concatenation) pipeline or a Supertree (or gene trees) pipeline.

The type of phylogenetic pipeline that will be executed is determined based on the settings present in the configuration file.

Supermatrix (or concatenation)

The Supermatrix pipeline is the default in PhyloPhlAn 3, determined also by the mandatory sections.

In other words, when neither gene_tree1 nor gene_tree2 sections are present in the configuration file, PhyloPhlAn 3 will perform a concatenation pipeline.

Supertree (or gene trees)

This approach is to be preferred when building a large phylogeny, and the required section in the configuration file to be present is: gene_tree1.

In order to use a gene trees pipeline, the user has to manually edit the [tree1] section in the configuration file in which the paths to the ASTRAL jar file and the example file for the version option (needed to verify the correct installation of ASTRAL) need to be specified.

Below the [tree1] section example template that needs to be edited:

[tree1]
command_line = #program_name# #input# #output#
program_name = java -jar /../path_to_astral/../astral.4.11.1.jar
input = -i
output = -o
version = -i /../path_to_astral/../astral-4.11.1/test_data/song_mammals.424.gene.tre

Note: the order of the options for the [tree1] section can differ from the above example when the config is automatically generated.

PhyloPhlAn assign SGBs

PhyloPhlAn 3 allows you to assign to each bin that comes from a metagenomic assembly analysis its closest species-level genome bins (SGBs, as defined in Pasolli, E et al. Cell (2019)).

The only mandatory parameter is -i, followed by the name of the input directory that contains the bins, for example:

phylophlan_assign_sgbs -i <input_folder>

Other parameters that can be specified are:

-o: allows you to decide the output prefix that will be used for the two output directories and the output file. If not specified, the prefix used is <input_folder>, so the two output folders will be <input_folder>_dists and <input_folder>_sketches, and the output file will be <input_folder>.tsv
-n: allows you to decide how many SGBs (sorted by increasing average genomic distance) will be reported for each input bin in the output file, the keyword all is accepted. If not specified, default is 10
--nproc: allows you to set how many CPUs can be used. Default is 1

A practical example of its usage is given in the example 3. Metagenomic analysis of the Ethiopian cohort

Output description

The phylophlan_assign_sgbs script has three different types of outputs: (1) list of the top -n/--how_many SGBs sorted by their average Mash distance, (2) closest SGB, GGB, FGB, and reference genomes, and (3) "all vs. all" matrix of all pairwise Mash distances.

Output 1

Each line reports the bin name and the list of the closest SGBs (sorted by their increasing average Mash distance) in a tab-separated fashion. The information of each SGB are separated by :. For example:

my_bin	(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance	[(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance]

Where:

my_bin: is the input bin name
(k|u)SGB_ID: is the SGB ID and starts with either k or u to indicate whether it is a known or an unknown SGB
taxa_level: can be either Species, Genus, Family, or Phylum, depending at which taxonomic level the SGB has been assigned to
taxonomy: is the full taxonomic label assigned to the SGB
average_mash_distance: is the average Mash distance of the input bin w.r.t. all the genomes in the SGB.

Output 2

Similar to Output 1., with the difference that the information reported are for the closest SGB, then the closest GGB, followed by the closest FGB, and finally the closest reference genomes, according to their respective Mash distances.

Output 3

In this case, phylophlan_assign_sgbs produces a square matrix of all pairwise distances of the only input bins against themselves.

Getting reference genomes of a specified species

This feature is used for retrieving reference genomes of a specified taxonomy.

This is particularly useful when you need to build a tree to phylogenetically compare your genomes with those available in public databases.

The only mandatory parameter is -g <label> used to specify the taxonomic label for which you need to download the reference genomes.

The <label> must represent any valid taxonomic level or the special case all:

-g s__<species_name>: an example is given in 1. High-resolution phylogeny of 135 Staphylococcus aureus isolate genomes
-g all: an example is given in 2. Build the tree of life and insert newly sequenced genomes into it

Finding strains in trees

The phylophlan_strain_finder script can be used to automatically detect subtrees in a phylogeny that are likely representing a strain, based on two measures that can be computed during the PhyloPhlAn 3 phylogenetic analysis: the phylogenetic distance and the mutation rates between all nodes of a subtree.

These threshold values for these two measures can be tuned using:

--phylo_thr <num> : the normalized phylogenetic distance between any node from the same subtree
--mutrate_thr <num>: the mutation rates between any node from the same subtree

When both of these conditions are satisfied for all nodes of a sub-tree, they are defined as the same strain.

The phylophlan_strain_finder script requires as input the phylogenetic tree (-i param) and the mutation rates table with the -m param:

phylophlan_strain_finder -i <input_tree> -m <mutation_rates.tsv>

Note: PhyloPhlAn 3 outputs the <mutation_rates.tsv> table only if the parameter --mutation_rates is specified when executing phylophlan, as explained here.

Drawing heatmaps to visualize the output from phylophlan_assign_sgbs

The phylophlan_draw_metagenomic script can be used to visualize the results obtained form phylophlan_assign_sgbs. Its basic usage is:

phylophlan_draw_metagenomic -i <output_metagenomic> --map <bin2meta.tsv>

where:

<output_metagenomic>: is the output file generated by phylophlan_assign_sgbs as detailed above
<bin2meta.tsv>: is a mapping file that links each bin to the metagenome it has been reconstructed from. It is a tab-separated file where the input bins are in the first column and metagenomes in the second column

Note: when building the mapping file, make sure the names used for bins are consistent with the ones used as inputs with phylophlan_assign_sgbs

A usage example of phylophlan_draw_metagenomic is given in the example 3. Metagenomic analysis of the Ethiopian cohort.

Requirements

Dependencies

Python (version >=3.0)
NumPy (version >=1.12.1)
Biopython (version >=1.70)
DendroPy (version >=4.2.0)

External Tools

PhyloPhlAn 3 also needs the following tools:

At least one phylogenetic inference software tool: RAxML, FastTree, IQ-TREE
At least one multiple sequence alignment tool: MUSCLE, MAFFT, Opal, UPP
trimAl for the trimming of the multiple sequence alignment (optional)
blast+ for database building and mapping of nucleotides databases
USEARCH and/or DIAMOND for database building and mapping of nucleotides and/or amino acids databases

Known Issues

In general, given that PhyloPhlAn 3 is a pipeline that interacts with external software, it might happen that from time to time the failure of one of the external tools may cause an unwanted interruption of the execution.

If you use DIAMOND or MAFFT, be aware that sometimes they might crash, most likely due to temporary files not correctly removed.

This means that if PhyloPhlAn 3 crashes during the execution of either DIAMOND or MAFFT, what we advise you to do to continue the analysis is, in this order:

Remove the last directory that has been generated in the output/tmp folder and re-launch the command, PhyloPhlan will re-start from where it failed, so the computation made up to that point is not lost.
If the previous solution does not work, re-start PhyloPhlAn changing the -i parameter with -c in order to clean all the output and output/tmp folders.
If also the previous solution does not work, re-start PhyloPhlAn with --clean_all and this will remove all installation and database files that are automatically generated at the first run of PhyloPhlAn 3.

`phylophlan.py`

This is the main PhyloPhlAn 3 script, other information available here.

usage: phylophlan.py [-h] [-i PROJECT_NAME | -c CLEAN] [-o OUTPUT]
                     [-d DATABASE] [-t {n,a}] [-f CONFIG_FILE] --diversity
                     {low,medium,high} [--accurate | --fast] [--clean_all]
                     [--database_list] [-s SUBMAT] [--submat_list]
                     [--submod_list] [--nproc NPROC]
                     [--min_num_proteins MIN_NUM_PROTEINS]
                     [--min_len_protein MIN_LEN_PROTEIN]
                     [--min_num_markers MIN_NUM_MARKERS]
                     [--trim {gap_trim,gap_perc,not_variant,greedy}]
                     [--gap_perc_threshold GAP_PERC_THRESHOLD]
                     [--not_variant_threshold NOT_VARIANT_THRESHOLD]
                     [--subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent}]
                     [--unknown_fraction UNKNOWN_FRACTION]
                     [--scoring_function {trident,muscle,random}] [--sort]
                     [--remove_fragmentary_entries]
                     [--fragmentary_threshold FRAGMENTARY_THRESHOLD]
                     [--min_num_entries MIN_NUM_ENTRIES] [--maas MAAS]
                     [--remove_only_gaps_entries] [--mutation_rates]
                     [--force_nucleotides] [--input_folder INPUT_FOLDER]
                     [--data_folder DATA_FOLDER]
                     [--databases_folder DATABASES_FOLDER]
                     [--submat_folder SUBMAT_FOLDER]
                     [--submod_folder SUBMOD_FOLDER]
                     [--configs_folder CONFIGS_FOLDER]
                     [--output_folder OUTPUT_FOLDER]
                     [--genome_extension GENOME_EXTENSION]
                     [--proteome_extension PROTEOME_EXTENSION] [--update]
                     [--verbose] [-v]

PhyloPhlAn is an accurate, rapid, and easy-to-use method for large-scale
microbial genome characterization and phylogenetic analysis at multiple levels
of resolution. PhyloPhlAn can assign finished, draft, or metagenome-assembled
genomes (MAGs) to species-level genome bins (SGBs). For individual clades of
interest (e.g. newly sequenced genome sets), PhyloPhlAn reconstructs strain-
level phylogenies from among the closest species using clade-specific
maximally informative markers. At the other extreme of resolution, PhyloPhlAn
scales to very-large phylogenies comprising >17,000 microbial species

optional arguments:
  -h, --help            show this help message and exit
  -i PROJECT_NAME, --input PROJECT_NAME
  -c CLEAN, --clean CLEAN
                        Clean the output and partial data produced for the
                        specified project (default: None)
  -o OUTPUT, --output OUTPUT
                        Output folder name, otherwise it will be the name of
                        the input folder concatenated with the name of the
                        database used (default: None)
  -d DATABASE, --database DATABASE
                        The name of the database of markers to use (default:
                        None)
  -t {n,a}, --db_type {n,a}
                        Specify the type of the database of markers, where "n"
                        stands for nucleotides and "a" for amino acids. If not
                        specified, PhyloPhlAn will automatically detect the
                        type of database (default: None)
  -f CONFIG_FILE, --config_file CONFIG_FILE
                        The configuration file to load. Four ready-to-use
                        configuration files can be generated using the
                        "write_default_configs.sh" script present in the
                        "configs" folder (default: None)
  --diversity {low,medium,high}
                        Specify the expected diversity of the phylogeny to
                        automatically adjust some parameters: "low": for
                        genus-/species-/strain-level phylogenies; "medium":
                        for class-/order-level phylogenies; "high": for
                        phylum-/tree-of-life size phylogenies (default: None)
  --accurate            Use more phylogenetic signal, which can result in more
                        accurate phylogeny; affected parameters depend on the
                        "--diversity" level (default: False)
  --fast                Perform a faster phylogeny reconstruction by
                        reducing the phylogenetic positions to be used; affected
                        parameters depend on the "--diversity" level (default:
                        False)
  --clean_all           Remove all installation and database files
                        automatically generated (default: False)
  --database_list       List of all the available databases that can be
                        specified with the -d/--database option (default:
                        False)
  -s SUBMAT, --submat SUBMAT
                        Specify the substitution matrix to use. Available
                        substitution matrices can be listed with "--
                        submat_list" (default: None)
  --submat_list         List of all the available substitution matrices that
                        can be specified with the -s/--submat option (default:
                        False)
  --submod_list         List of all the available substitution models that can
                        be specified with the --maas option (default: False)
  --nproc NPROC         The number of cores to use (default: 1)
  --min_num_proteins MIN_NUM_PROTEINS
                        Proteomes with less than this number of proteins will
                        be discarded (default: 1)
  --min_len_protein MIN_LEN_PROTEIN
                        Proteins in proteomes shorter than this value will be
                        discarded (default: 50)
  --min_num_markers MIN_NUM_MARKERS
                        Input genomes or proteomes that map to less than the
                        specified number of markers will be discarded
                        (default: 1)
  --trim {gap_trim,gap_perc,not_variant,greedy}
                        Specify which type of trimming to perform: "gap_trim":
                        execute what specified in the "trim" section of the
                        configuration file; "gap_perc": remove columns with a
                        percentage of gaps above a certain threshold (see "--
                        gap_perc_threshold" parameter); "not_variant": remove
                        columns with at least one nucleotide/amino acid
                        appearing above a certain threshold (see "--
                        not_variant_threshold" parameter); "greedy": performs
                        all the above trimming steps; if not specified, no
                        trimming will be performed (default: None)
  --gap_perc_threshold GAP_PERC_THRESHOLD
                        Specify the value used to consider a column not
                        variant when "--trim not_variant" is specified
                        (default: 0.67)
  --not_variant_threshold NOT_VARIANT_THRESHOLD
                        Specify the value used to consider a column not
                        variant when "--trim not_variant" is specified
                        (default: 0.99)
  --subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent}
                        The number of positions to retain from each single
                        marker. Available options are: "phylophlan": specific
                        number of positions for each PhyloPhlAn marker (only
                        when "--database phylophlan" is specified); "onethousand":
                        return the top 1000 positions; "sevenhundred": return the
                        top 700; "fivehundred": return the top 500; "threehundred"
                        return the top 300; "onehundred": return the top 100
                        positions; "fifty": return the top 50 positions;
                        "twentyfive": return the top 25 positions;
                        "fiftypercent": return the top 50 percent positions;
                        "twentyfivepercent": return the top 25% positions;
                        "tenpercent": return the top 10% positions; if not
                        specified, the complete alignment will be used
                        (default: None)
  --unknown_fraction UNKNOWN_FRACTION
                        Define the amount of unknowns ("X" and "-") allowed in
                        each column of the MSA of the markers (default: 0.3)
  --scoring_function {trident,muscle,random}
                        Specify which scoring function to use to evaluate
                        columns in the MSA results (default: None)
  --sort                If specified, the markers will be ordered. When using
                        the PhyloPhlAn database, it will be automatically set
                        to "True" (default: False)
  --remove_fragmentary_entries
                        If specified, the MSAs will be checked and cleaned from
                        fragmentary entries. See --fragmentary_threshold for
                        the threshold values above which an entry will be
                        considered fragmentary (default: False)
  --fragmentary_threshold FRAGMENTARY_THRESHOLD
                        The fraction of gaps in the MSA to be considered
                        fragmentary and hence discarded (default: 0.85)
  --min_num_entries MIN_NUM_ENTRIES
                        The minimum number of entries to be present for each
                        of the markers in the database (default: 4)
  --maas MAAS           Select a mapping file that specifies the amino acid
                        substitution model to be used for each of the markers
                        for the gene tree reconstruction. The file must be tab-
                        separated (default: None)
  --remove_only_gaps_entries
                        If specified, entries in the MSAs composed only of
                        gaps ("-") will be removed. This is equivalent to
                        specify "--remove_fragmentary_entries
                        --fragmentary_threshold 1" (default: False)
  --mutation_rates      If specified, will produce a mutation rates table for
                        each of the aligned markers and a summary table for
                        the concatenated MSA. This operation can take a long
                        time to finish (default: False)
  --force_nucleotides   If specified, force PhyloPhlAn to use nucleotide
                        sequences for the phylogenetic analysis, even in the
                        case of a amino acids database (default: False)
  --update              Update the databases file (default: False)
  --verbose             Make PhyloPhlAn verbose (default: False)
  -v, --version         Print the current PhyloPhlAn version and exit

Folder paths:
  Parameters for setting folder locations

  --input_folder INPUT_FOLDER
                        Path to the folder containing the input data (default:
                        input/)
  --data_folder DATA_FOLDER
                        Path to the folder where to store the intermediate
                        files. Default is "tmp" inside the project's output
                        folder (default: None)
  --databases_folder DATABASES_FOLDER
                        Path to the folder containing the database files
                        (default: phylophlan_databases/)
  --submat_folder SUBMAT_FOLDER
                        Path to the folder containing the substitution
                        matrices to be used to compute the column score for
                        the subsampling step (default:
                        phylophlan_substitution_matrices/)
  --submod_folder SUBMOD_FOLDER
                        Path to the folder containing the mapping file with
                        substitution models for each marker for the gene tree
                        building (default: phylophlan_substitution_models/)
  --configs_folder CONFIGS_FOLDER
                        Path to the folder containing the configuration files
                        (default: phylophlan_configs/)
  --output_folder OUTPUT_FOLDER
                        Path to the output folder where to save the results
                        (default: )

Filename extensions:
  Parameters for setting the extensions of the input files

  --genome_extension GENOME_EXTENSION
                        Extension for input genomes (default: .fna)
  --proteome_extension PROTEOME_EXTENSION
                        Extension for input proteomes (default: .faa)

`phylophlan_setup_database.py`

This script is used to build a custom database and it should be used if the user decides not to use one of the two databases provided. The output is a folder containing the markers ready to be used in phylophlan through the option -d, followed by the name of the said folder. Other information here.

usage: phylophlan_setup_database.py [-h] [-i INPUT | -g GET_CORE_PROTEINS]
                                    [--database_update] [-o OUTPUT]
                                    [-d DB_NAME] [-e INPUT_EXTENSION]
                                    [-t {n,a}] [-x OUTPUT_EXTENSION]
                                    [--overwrite] [--verbose] [-v]

The phylophlan_setup_database.py script can be used to either format an input
folder or multi-fasta file to be used as database in phylophlan.py, or to
automatically download a pre-identified set of core UniRef90 proteins for the
taxonomic label of a given species

optional arguments:
  -h, --help            Show this help message and exit
  -i INPUT, --input INPUT
                        Specify the path to either the folder containing the
                        marker files or the file of markers, in (multi-)fasta
                        format (default: None)
  -g GET_CORE_PROTEINS, --get_core_proteins GET_CORE_PROTEINS
                        Specify the taxonomic label for which to download the set
                        of core proteins. The label must represent a species:
                        "--get_core_proteins s__Escherichia_coli" (default:
                        None)
  --database_update     Update the databases file (default: False)
  -o OUTPUT, --output OUTPUT
                        Specify path to the output folder where to save the
                        database (default: None)
  -d DB_NAME, --db_name DB_NAME
                        Specify the name of the output database (default:
                        None)
  -e INPUT_EXTENSION, --input_extension INPUT_EXTENSION
                        Specify the extension of the input file(s) specified
                        via -i/--input (default: None)
  -t {n,a}, --db_type {n,a}
                        Specify the type of the database, where "n" stands for
                        nucleotides and "a" for amino acids (default: None)
  -x OUTPUT_EXTENSION, --output_extension OUTPUT_EXTENSION
                        Set the database output extension (default: None)
  --overwrite           If specified and the output file exists, it will be
                        overwritten (default: False)
  --verbose             Print more stuff (default: False)
  -v, --version         Print the current phylophlan_setup_database.py
                        version and exit

`phylophlan_write_config_file.py`

This script allows the user to customize the phylogenetic analysis by creating a personalized configuration file, deciding which software to use for every mandatory section among the available ones, as seen above. The output is a text file, so if the user desires to customize the parameters of the selected software according to specific needs and the type of the analysis to be executed, the user should open the generated configuration file with a text editor and then add/remove the specific options. Other information here.

usage: phylophlan_write_config_file.py [-h] -o OUTPUT -d {n,a}
                                       (--db_dna {makeblastdb} | --db_aa {usearch,diamond})
                                       [--map_dna {blastn,tblastn,diamond}]
                                       [--map_aa {usearch,diamond}] --msa
                                       {muscle,mafft,opal,upp}
                                       [--trim {trimal}]
                                       [--gene_tree1 {fasttree,raxml,iqtree}]
                                       [--gene_tree2 {raxml}] --tree1
                                       {fasttree,raxml,iqtree,astral,astrid}
                                       [--tree2 {raxml}] [-a]
                                       [--force_nucleotides] [--overwrite]
                                       [--verbose] [-v]

The phylophlan_write_config_file.py script generates a configuration file to
be used with the phylophlan.py script. It implements some standard parameters
for the software integrated, but if needed, the parameters of the selected
software can be added/modified/removed by editing the generated configuration
file using a text editor

optional arguments:
  -h, --help            Show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Specify the output file where to write the
                        configurations (default: None)
  -d {n,a}, --db_type {n,a}
                        Specify the type of the database, where "n" stands for
                        nucleotides and "a" for amino acids (default: None)
  --db_dna {makeblastdb}
                        Add the "db_dna" section of the selected software that
                        will be used for building the indexed database
                        (default: None)
  --db_aa {usearch,diamond}
                        Add the "db_aa" section of the selected software that
                        will be used for building the indexed database
                        (default: None)
  --map_dna {blastn,tblastn,diamond}
                        Add the "map_dna" section of the selected software
                        that will be used for mapping the database against the
                        input genomes (default: None)
  --map_aa {usearch,diamond}
                        Add the "map_aa" section of the selected software that
                        will be used for mapping the database against the
                        input proteomes (default: None)
  --msa {muscle,mafft,opal,upp}
                        Add the "msa" section of the selected software that
                        will be used for producing the MSAs (default: None)
  --trim {trimal}       Add the "trim" section of the selected software that
                        will be used for the removal of the gappy regions of
                        the MSAs (default: None)
  --gene_tree1 {fasttree,raxml,iqtree}
                        Add the "gene_tree1" section of the selected software
                        that will be used for building the phylogenies for the
                        markers in the database (default: None)
  --gene_tree2 {raxml}  Add the "gene_tree2" section of the selected software
                        that will be used for refining the phylogenies
                        previously built with what specified in the
                        "gene_tree1" section (default: None)
  --tree1 {fasttree,raxml,iqtree,astral,astrid}
                        Add the "tree1" section of the selected software that
                        will be used for building the first phylogeny
                        (default: None)
  --tree2 {raxml}       Add the "tree2" section of the selected software that
                        will be used for refining the phylogeny previously
                        built with what specified in the "tree1" section
                        (default: None)
  -a, --absolute_path   Write the absolute path to the executable instead of
                        the executable name as found in the system path
                        environment (default: False)
  --force_nucleotides   If specified, sets parameters for phylogenetic analysis
                        software so that they use nucleotide sequences, even
                        in the case of a database of amino acids (default:
                        None)
  --overwrite           Overwrite output file if it exists (default: False)
  --verbose             Print more stuff (default: False)
  -v, --version         Print the current phylophlan_write_config_file.py
                        version and exit

`phylophlan_assign_sgbs.py`

For each bin that comes from a metagenomic assembly analysis, this script reports the closest species-level genome bins (SGBs). This is particularly useful when the user needs to analyze bins assembled from metagenomes. The main output file to consider will be a tsv file containing, for each bin of interest, information about the SGB it has been assigned to. Other information here.

usage: phylophlan_assign_sgbs.py [-h] [-i INPUT] [-o OUTPUT_PREFIX]
                                 [-d DATABASE] [--database_list]
                                 [--database_update] [-e INPUT_EXTENSION]
                                 [-n HOW_MANY] [--nproc NPROC]
                                 [--database_folder DATABASE_FOLDER]
                                 [--only_input] [--add_ggb_fgb]
                                 [--overwrite] [--verbose] [-v]

The phylophlan_assign_sgbs.py script assigns SGB and taxonomy to a given set of
input genomes. Outputs can be of three types: (1) for each input genome,
returns the list of the closest -n/--how_many SGBs sorted by average Mash
distance; (2) for each input genome, returns the closest SGB, GGB, FGB, and
reference genomes; (3) returns an all vs. all matrix with all the pairwise mash
distances

optional arguments:
  -h, --help            Show this help message and exit
  -i INPUT, --input INPUT
                        Input folder containing the metagenomic bins to be
                        indexed (default: None)
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix used for the output folders: indexed bins,
                        distance estimations. If not specified, the input
                        folder will be used (default: None)
  -d DATABASE, --database DATABASE
                        Database name. Available options can be listed using
                        the --database_list parameter (default: None)
  --database_list       List of all the available databases that can be
                        specified with the -d/--database option (default:
                        False)
  --database_update     Update the databases file (default: False)
  -e INPUT_EXTENSION, --input_extension INPUT_EXTENSION
                        Specify the extension of the input file(s) specified
                        via -i/--input. If not specified, PhyloPhlAn will try
                        to infer it from the input files (default: None)
  -n HOW_MANY, --how_many HOW_MANY
                        Specify the number of SGBs to report in the output;
                        "all" is a special value to report all the SGBs; this
                        param is not used when "--only_input" is specified
                        (default: 10)
  --nproc NPROC         The number of CPUs to use (default: 1)
  --database_folder DATABASE_FOLDER
                        Path to the folder that contains the database file
                        (default: phylophlan_databases/)
  --only_input          If specified, provide a distance matrix between only
                        the input genomes provided (default: False)
  --add_ggb_fgb         If specified adds GGB and FGB assignments. It will be 
                        adding a column for each that reports the closest reference 
                        genome, -n/--how_many will be set to 1 (default: False)
  --overwrite           If specified, overwrite the output file if exists
                        (default: False)
  --verbose             Print more stuff (default: False)
  -v, --version         Print the current phylophlan_assign_sgbs.py version
                        and exit

`phylophlan_get_reference.py`

This script is used to get reference genomes of a specified species. This is particularly useful when the user needs to build a tree to confront samples with an existing one. When using the -g parameter, the output will be a directory with the requested genomes. Other information here.

usage: phylophlan_get_reference.py [-h] [-g GET | -l] [--database_update]
                                   [-e OUTPUT_FILE_EXTENSION] [-o OUTPUT]
                                   [-n HOW_MANY] [-m GENBANK_MAPPING]
                                   [--verbose] [-v]

The phylophlan_get_reference.py script allows to download a specified number
(-n/--how_many) of reference genomes from the Genbank repository. Special case
"all" allows to download a specified number of reference genomes for all
available taxonomic species. With the -l/--list_clades params the
phylophlan_get_reference.py script returns the list of all species in the
database

optional arguments:
  -h, --help            Show this help message and exit
  -g GET, --get GET     Specify the taxonomic label for which download the set
                        of reference genomes. The label must represent a valid
                        taxonomic level or the special case "all" (default:
                        None)
  -l, --list_clades     Print for all taxa the total number of species and
                        reference genomes available (default: False)
  --database_update     Update the databases file (default: False)
  -e OUTPUT_FILE_EXTENSION, --output_file_extension OUTPUT_FILE_EXTENSION
                        Specify extension of the output files
                        (default: .fna.gz)
  -o OUTPUT, --output OUTPUT
                        Specify the path to the output folder where to save
                        the files, required when -g/--get is specified
                        (default: None)
  -n HOW_MANY, --how_many HOW_MANY
                        Specify how many reference genomes to download, where
                        -1 stands for "all available" (default: 4)
  -m GENBANK_MAPPING, --genbank_mapping GENBANK_MAPPING
                        The local GenBank mapping file. If not found, it will
                        be automatically downloaded (default:
                        assembly_summary_genbank.txt)
  --verbose             Print more stuff (default: False)
  -v, --version         Print the current phylophlan_get_reference.py version
                        and exit

`phylophlan_strain_finder.py`

This script can be used to perform analysis on trees built with phylophlan. The output is a table that contains the subtrees and information about the minimum, mean, and maximum distance between nodes in the subtree, the minimum, mean and maximum mutation rate between nodes in the subtree, and the distance and mutation rate between each node in the subtree. Other information here.

usage: phylophlan_strain_finder.py [-h] -i INPUT -m MUTATION_RATES
                                   [--p_threshold P_THRESHOLD]
                                   [--m_threshold M_THRESHOLD]
                                   [--tree_format {newick,nexus,phyloxml,cdao,nexml}]
                                   [-o OUTPUT] [--overwrite] [-s {;,,,	}]
                                   [--verbose] [-v]

The phylophlan_strain_finder.py script analyzes the phylogeny and the mutation
rates table generated from the phylophlan.py script and returns sub-trees
representing the same strain, according to both a phylogenetic threshold
(computed on the normalized pairwise phylogenetic distances) and a mutation
rate threshold (computed on the aligned sequences of the markers used in the
phylogenetic analysis)

optional arguments:
  -h, --help            Show this help message and exit
  -i INPUT, --input INPUT
                        Specify the file of the phylogenetic tree as generated
                        from phylophlan.py (default: None)
  -m MUTATION_RATES, --mutation_rates MUTATION_RATES
                        Specify the file of the mutation rates as generated
                        from phylophlan.py (default: None)
  --p_threshold P_THRESHOLD
                        Maximum phylogenetic distance threshold for every pair
                        of nodes in the same subtree (inclusive) (default:
                        0.05)
  --m_threshold M_THRESHOLD
                        Maximum mutation rate ratio for every pair of nodes in
                        the same subtree (inclusive) (default: 0.05)
  --tree_format {newick,nexus,phyloxml,cdao,nexml}
                        Specify the format of the input tree (default: newick)
  -o OUTPUT, --output OUTPUT
                        Specify the output filename. If not specified, it will
                        be stdout (default: None)
  --overwrite           Overwrite the output file if exists (default: False)
  -s {;,,,	}, --separator {;,,,	}
                        Specify the separator to use in the output (default: )
  --verbose             Print more stuff (default: False)
  -v, --version         Print the current phylophlan_strain_finder.py version
                        and exit

`phylophlan_draw_metagenomic.py`

This script can be used to visualize the results obtained with phylophlan_assign_sgbs. The outputs are two heatmaps, one showing the presence/absence of the top SGBs (customizable through --top) in the metagenomes, the other showing the number of kSGBs and uSGBs in each metagenome, and two relative output files containing the data used to build them. Other information here

usage: phylophlan_draw_metagenomic.py [-h] -i INPUT -m MAP [--top TOP]
                                      [-o OUTPUT] [-s SEPARATOR] [--dpi DPI]
                                      [-f F] [--verbose] [-v]

The phylophlan_draw_metagenomic.py script takes as input the output table
generated from the phylophlan_assign_sgbs.py script and produces two heatmap
figures: (1) presence/absence heatmap of the SGBs in the metagenomic samples;
and (2) heatmap showing the amount of kSGBs and uSGBs in each metagenome.

optional arguments:
  -h, --help            Show this help message and exit
  -i INPUT, --input INPUT
                        The input file generated from
                        phylophlan_assign_sgbs.py (default: None)
  -m MAP, --map MAP     A mapping file that maps each bin to its metagenome
                        (default: None)
  --top TOP             The number of SGBs to display in the figure (default:
                        20)
  -o OUTPUT, --output OUTPUT
                        Prefix of the output files (default: output_heatmap)
  -s SEPARATOR, --separator SEPARATOR
                        The separator used in the mapping file (default: )
  --dpi DPI             Dpi resolution of the images (default: 200)
  -f F                  Images output format (default: svg)
  --verbose             Print more stuff (default: False)
  -v, --version         Print the current phylophlan_draw_metagenomic.py
                        version and exit

PhyloPhlAn 1.0

You can find here the wiki of the first PhyloPhlAn implementation and here the zip or tar.bz2 as in:

PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes
Nicola Segata, Daniela Börnigen, Xochitl C. Morgan, and Curtis Huttenhower
Nature Communications, vol. 4, p. 2304, Jul. 2013
DOI: 10.1038/ncomms3304

Home

PhyloPhlAn 3

Installation

Conda package [easy]

Repository from GitHub [hard]

Test PhyloPhlAn installation

Citation

Basic usage

Input Files

Nucleotide or Amino acid

Diversity

Accurate or Fast

Output

Parallel computations

Databases

Offline installation

Expert usage

Quality control of inputs and phylogenetic markers

Accurate or Fast

Trimming

Subsampling

Scoring function

Substitution matrices

Substitution models

Mutation rates table

Sorting

Database setup

Configuration File

Custom configuration file

Mandatory sections

Optional sections

Integrating new tools in the framework

Configuration variables explained

Supermatrix or Supertree approach

Supermatrix (or concatenation)

Supertree (or gene trees)

PhyloPhlAn assign SGBs

Output description

Getting reference genomes of a specified species

Finding strains in trees

Drawing heatmaps to visualize the output from phylophlan_assign_sgbs

Requirements

Dependencies

External Tools

Known Issues

phylophlan.py

phylophlan_setup_database.py

phylophlan_write_config_file.py

phylophlan_assign_sgbs.py

phylophlan_get_reference.py

phylophlan_strain_finder.py

phylophlan_draw_metagenomic.py

PhyloPhlAn 1.0

Clone this wiki locally

`phylophlan.py`

`phylophlan_setup_database.py`

`phylophlan_write_config_file.py`

`phylophlan_assign_sgbs.py`

`phylophlan_get_reference.py`

`phylophlan_strain_finder.py`

`phylophlan_draw_metagenomic.py`