Skip to content

7. guide to parameter selections during midway break

Rauf Salamzade edited this page Sep 6, 2024 · 14 revisions

The best way to get a sense on how to perform parameter selection for population delineations and BGC clustering into GCFs is to check out the 3 tutorials (Wiki pages: 4, 5 (main tutorial), and 6).

lsaBGC-Cluster and its report for parameter selection were modified slightly in lsaBGC-Pan, see the belong section for additional details on lsaBGC-Cluster and interpreting the report.

Here are some general recommendations and good to know info however:

  • Population designations should generally be in agreement with phylogenetic clades.
  • While not demonstrated in the tutorials, which use the --population-identity-cutoff parameter to determine populations, the populations/clades can also be specified manually via the argument --manual-populations-file.
  • lsaBGC-Cluster aims to identify gene cluster families (GCFs) that are evolutionarily related but might lead to the production of different secondary metabolites. If you are interested in identifying GCFs which coincide to production of the exact same metabolite, we recommend the software BiG-SCAPE which specifically aims to do this.
  • Minimizing the number of singleton populations or GCFs where there is only one genome or one BGC will lead to more informative analyses.
  • Increasing the number of distinct GCFs which do not feature paralogous copies will be most informative. That is for the grey/black barplot, having the bars be more gray (single-copy) is desired. Note, for zol analysis, only one representative BGC per GCF per sample/genome is chosen for population genetic calculation purposes by default anyways. Note, this becomes more difficult to interpret when using draft quality assemblies where BGCs might be fragmented and thus might elevate the amount of multi-copy BGCs from samples observed across GCFs (especially those which are larger > 100 kb).
  • GCFs should largely correspond to one BGC type - but as can be seen in the Streptomyces and Aspergillus flavus tutorials - dense genomes with lots of potentially independent BGCs in close proximity to each other can lead to difficulties with achieving clean partitions.
  • It should be desired that GCFs have an evident core set of orthogroups.
  • If for some reason you do not care for carefully setting these parameters (which you really should) - you can skip the mid-way break by using the argument --no-break.

lsaBGC-Cluster Overview

lsaBGC-Cluster's algorithm is similar to and inspired by BiG-SCAPE (Munoz, Selem-Mjoica, Mullowney, et al. 2020), which has revolutionized clustering of individual biosynthetic gene clusters (BGCs) into gene cluster families (GCFs) for lineage specific exploration and mapping of biosynthetic potential; however, it defers in some key ways:

  1. Most importantly, lsaBGC-Cluster aims to group orthologous + in-paralogous BGCs into a single GCF. This is done through manual validation of clustering quality via the report on parameter effects that is produced. BiG-SCAPE parameters can also be adjusted but typically people run it using default parameters that the BiG-SCAPE authors established by attempting to maximize congruence between BGC clustering and metabolomics data.
  2. lsaBGC-Cluster assesses BGC pairwise similarity at the protein level rather than at the domain level like BiG-SCAPE, which has its advantages and disadvantages. The advantage is more data to determine orthology when working with closely related lineaages or species, the disadvantage is that multi-domain proteins are not well handled by many orthology inference software. nevertheless, OrthoFinder should still be able to resolve orthology between more distantly related taxonomic groups as well.
  3. lsaBGC uses 3rd party software such as OrthoFinder and Panaroo to infer orthology at a genome-wide scale, these protein ortholog groups are then used to cluster BGCs. BiG-SCAPE does not account for genomic context to our knowledge and thus relies on direct assessment of sequence / syntenic similarity between BGCs directly.

Algorithm for clustering

As mentioned, the algorithm of lsaBGC-Cluster is similar to BiG-SCAPE and both programs cluster BGCs into GCFs based on both sequence and syntenic conservation.

lsaBGC-Cluster takes the comprehensive set of BGCs across all genomes and first identifies pairs of similar BGCs if they share some percentage of ortholog groups (the Jaccard similarity index) and they share some global syntenic similarity (measured using correlation of homologous ortholog groups, see below). These correspond to parameters that are adjustable by users (via the arguments --cluster-jaccard and --cluster-syntenic-correlation in lsaBGC-Pan). Afterwards, the pairs of similar BGCs are clustered using MCL (Markov clustering) for which the granularity of is controlled by an "inflation" parameter - which is also adjustable by the user (via the option --cluster-inflation).

lsaBGC-Cluster was originally designed to work with BGCs from complete genomes only and did not aim to handle impartial BGCs. This is why we integrated BiG-SCAPE in lsaBGC-Easy.py, because it did a better job of accounting for such partial BGC instances. However, more recently, in lsaBGC-Pan, we introduced the --cluster-containment argument which automatically considers pairs of BGCs as similar prior to clustering if one BGC in the pair being considered is near a scaffold/contig edge and at least some percentage of its genes are found in the other BGC.

The default values for these options are:

  • --cluster-jaccard: 50.0
  • --cluster-inflation: 0.8
  • --cluster-syntenic-correlation: 0.4
  • --cluster-containment: 70.0

Selection of Parameter Values for Jaccard Similarity Cutoff and MCL Inflation

We strongly advocate for users to assess the report produced by lsaBGC-Cluster and manually curate parameters for their dataset. This is because the scope of the lineage might influence which clustering parameters are best suited for the analyses of interest for you (e.g. if working in the context of just a single species or sub-species, perhaps a more course clustering configuration could lead to reduced singleton BGCs left unassigned to a GCF while still preserving the assumption of orthology between BGC instances within such clusters).

Here is an example report:

Example Plots_Depicting_Parameter_Influence_on_GCF_Clustering.pdf

Here is a description of the figures the report will feature:

Title Page:

The plot on the title page shows the number of GCFs which resulted (generally want to maximize as this will lead to more clean delineation of orthologous BGCs from paralogous BGCs) vs. the number of singleton BGCs which remain un-clustered(want to minimize to be inclusive of orthologous variants of BGCs which have significantly diverged/altered) when using different values of the Jaccard Index and MCL Inflation parameters. Each point corresponds to results from clustering the BGCs into GCFs at a specific Jaccard Index and MCL Inflation combination. A jitter is added to prevent overlapping points, so interpret the plot carefully.

2nd Page:

The following 2nd page in the report contains four heatmaps, colored in different shades of the primary colors. Each cell in the heatmap corresponds to statistic assessing the quality of the GCF clustering performed using some specific MCL Inflation value (x-axis) and a specific Jaccard Similarity cutoff (y-axis).

  • Blue Heatmap (upper left) - The shading represents the ratio of the number of BGCs which are singletons relative to the number of BGCs which were grouped into GCFs. The larger this ratio, the darker the shading.
  • Green Heatmap (upper right) - The shading represents the number of GCFs formed from clustering with specific parameters. The larger this count, the darker the shading.
  • Yellow Heatmap (lower left) - The shading represents the proportion of GCFs formed from clustering with specific parameters which contain BGCs with distinct annotations from AntiSMASH (e.g. an ectoine BGC and a terpene BGC being grouped together). BGCs which are not annotated are accounted for and considered as belonging to a distinct annotation category "unknown". The larger this proportion, the darker the shading.
  • Red Heatmap (lower right) - The shading represents the proportion of GCFs formed from clustering with specific parameters which contain BGCs with distinct annotations from AntiSMASH (e.g. an ectoine BGC and a terpene BGC being grouped together). BGCs which are not annotated are ignored and not treated as conflicting evidence for multiple annotation categories being present in a single GCF. The larger this proportion, the darker the shading.

3rd Page:

The 3rd page of the report features another set of overview plots, this time in the form of adjacent grids of boxplots. Each grid corresponds to a specific Jaccard Similarity cutoff values, whereas the individual boxplots correspond to different MCL Inflation parameter values. The boxplots depict a range of statistics across each GCF resulting from the respective clustering.

  • First Row - Number of core homolog groups which are found in all samples with the GCF.
  • Second Row - Number of samples which have each GCF.
  • Third Row - Average number of genes observed in each GCF.
  • Fourth Row - Standard deviation for number of genes observed in each GCF.

As will often be apparent from these plots, the Jaccard Similarity cutoff parameter is by far more critical in impacting GCF clustering results than the MCL Inflation parameter which can be used more for fine tuning clustering results.

Clustering Parameter Combination Specific Views 4th Page +

The 4th page and onwards, the bulk of the report, shows the individual GCF level details for each clustering attempt with a parameter combination. These pages comprise of 4 subplots.

The top row features two distinct plots. The left plot is a histogram of sample counts for each GCF (how many samples each GCF has), colored according to whether the GCF has a core gene set. BGCs which are singletons are colored in grey as the notion of possessing a core gene set doesn't apply to them. The right plot is a simple histogram of the standard deviation of gene counts per GCF. If GCFs have many small and large BGCs, this might imply the clustering had some issues and might be under-clustering (clustering too coarsely).

The middle row plot is a bar chart, where each bar corresponds to a GCF, and the coloring indicates whether it is a primary instance of the GCF in a sample's genome or a secondary instance. This is to aid identifying whether GCFs should be further split because they include paralogous BGCs. A complication in interpreting this plot is that when many of the input genomes are partial/draft quality, then multiple BGCs might be grouped into the same GCF simply because they are chopped up pieces of the same BGC that assembly was unable to put together.

The bottom row plot is a bar chart which shows a subset of the largest GCFs, ranked in order. The coloring corresponds to the BGC-type/product-type of the individual BGCs. The details can be found in the spreadsheet later, but what is essential is that you probably want to maximize homogeneity - grouping similar BGC-types into a single GCF - sometimes with hybrid BGC regions however this gets a little more complicated. Heterogeneity can also arise do to a GCF being a mixture of GECCO and antiSMASH BGCs if running a hybrid analysis. Sometimes, for lineages with lots of BGCs and GCFs.

Final Remarks for Assessing Parameters

Once again, the purpose of these plots is to help guide which values to select for the two parameters controlling clustering for your specific lineage of interest. However, it might be that during exploration of the individual GCFs, one of them really fascinates you, in such case please check out our software suite zol & fai which can perform targeted detection and aid further exploration of the GCF.

Optional Additional Parameters Controlling Clustering

Annotation Category

Similar to BiG-SCAPE, lsaBGC also supports guiding clustering based on general annotation classes; however, this is implemented at the level of assessing whether pairs of BGCs can be orthologous. However, currently this is not on and not adjustable in lsaBGC-Pan - if there is strong desire to bring this option to lsaBGC-Pan, open a ticket and let us know!

Global Syntenic Correlation

Unlike BiG-SCAPE, lsaBGC-Cluster assesses global syntenic similarity between BGCs. In BiG-SCAPE however an adjacency index is computed based on a measure of pairs of domains which are commonly found adjacent to one another.

lsaBGC-Cluster's global syntenic similarity metric is based on ranking shared homolog groups and their order of occurrence along the BGC by assessing the correlation in ranked order. The reverse order of the secondary BGC is also tested against the order of the primary BGC and the maximum correlation value is retained.

Only homolog groups which occur in a single copy in the context of each BGC are used to compute the correlation in ranked order and the correlation is only computed for pairs of BGCs which share at least three such homolog groups.

image