Skip to content

5. tutorial on joint AntiSMASH & GECCO analysis of BGCs from Streptomyces olivaceus

Rauf Salamzade edited this page Aug 5, 2024 · 13 revisions

Overview

This tutorial applies lsaBGC-Pan to investigates BGCs from Streptomyces olivaceus. It is inspired by the cool study by Wang et al. 2022: Habitat Adaptation Drives Speciation of a Streptomyces Species with Distinct Habitats and Disparate Geographic Origins where the authors found that the species is phylogenetically diverged corresponding to habitat and geography.

For the analysis we used skDER to automatically download and dereplicate genomes belonging to the S. olivaceus species based on GTDB classification. This resulted in a set of 23 genomes.

Download the input dataset

The input for this tutorial are pre-computed antiSMASH (v7.0.0) results for S. olivaceus that can be downloaded off FigShare:

# download the dataset
wget https://figshare.com/ndownloader/files/48144775

# uncompress it 
tar -zxvf 48144775

You should have a folder called AntiSMASH_Results/ in your current workspace now.

Note, running this tutorial on a laptop - make sure you are using the smaller sized database build and be prepared for a long run that uses a lot of your machines threads. I don't have a MacBook Pro - but most appear to have 8 threads available - which means you probably would run this with 6 threads which is not a lot and because Streptomyces are so rich in BGCs, this can take upwards to 10 hours to run to completion and require nearly 16 GB (if not slightly more) of memory (which should be standard for most laptops).

Step 1:

As in the other tutorials, let's run lsaBGC-Pan with a general (not customized) command for the first part, since we don't know what parameters are most appropriate:

lsaBGC-Pan -a AntiSMASH_Results/ -o Pan_Results/ --threads 6 --run-gecco --use-panaroo

Here we specify the argument --use-panaroo because Panaroo is generally one of the most accurate approaches for orthology inference when working within the context of a single bacterial species, which we are doing.

When providing antiSMASH results as input, you will see the following message describing the expected input folder:

You have provided a directory with antiSMASH results - within which
we expect discrete subdirectories that each correspond to an individual 
sample. Furthermore, within each such subdirectory, multiple GenBank 
files are expected. Specifically, at least one GenBank file with 
'.region' substring in its name (corresponding to a BGC region) and 
one GenBank without the substring (corresponding to the full genome) 
are expected. If these expectations are not met, the subdirectory will 
be ignored. Note, the name of the subdirectory will be regarded as the 
sample name. GenBank files must also end with the suffix '.gbk'. In
addition, lsaBGC uses locus tags - which are automatically renamed 
to ensure seamless computing - if retention of original locus tags
is preferred, users can issue the --keep_locus_tags option.

Critically, the argument --run-gecco specified also signifies that we want to both process the antiSMASH predicted BGCs and also predict and incorporate GECCO BGCs. BGCs from both prediction methods will be ranked by length and smaller BGCs which overlap >10% of coordinates will be dropped - thus leaving distinct BGCs for downstream analysis.

If all goes well, lsaBGC-Pan should come to break to allow users to adjust parameters for defining populations/clades and controlling the granularity of BGC clustering into GCFs.

Selecting population stratification parameters

Based on Wang et al. there were two clear partitions corresponding to different habitats and geographies for the species. We can take a look in the folder Pan_Results/Delineate_Populations/ to see if we similarly see a major split in the phylogeny and find a value for the -pic argument which best achieves such a clading. This parameter is the protein % identity cutoff to define genomes as belonging to the same population. We see that a value of 99.0 works pretty well at achieving this - with one genome treated as a singleton, not belonging to either of the two major populations/clades:

Also, this figure is not great because of label overlap and clades being cutoff, but as described in greater detail on page 4, you can easily customize Rscripts and recreate figures from lsaBGC-Pan to get them to publication quality.

Selecting parameters controlling clustering of BGCs into GCFs

To assess which parameters to apply for clustering BGCs into GCFs, we can check out the file GCF_Clustering/Plots_Depicting_Parameter_Influence_on_GCF_Clustering.pdf. The first page in the report will show the number of GCFs (x-axis) vs. the proportion of total BGCs that are singletons (sole members of their GCFs). It depends on what analyses one aims to perform, but one could aim to maximize the number of distinct GCFs while reducing the number of singleton BGCs. Based on this first plot, a Jaccard Index cutoff of 20 and an MCL inflation paramter of 5 might be appropriate.

We can also take a look at the second page of the report for additional insight:

Here, based on the red heatmap, we see that perhaps increasing the Jaccard Index cutoff to 30 might be more appropriate because it limits the number of multi-annotation GCFs.

Down below in the report, we can go to the page which shows more details on the impact of selecting a Jaccard Index cutoff of 30 and an MCL clustering inflation parameter of 5.0:

We see that paralogous GCFs are relatively limited based on the grey/black bar in comparison to other parameter configurations.

We also see that a core set of orthogroups exists for all non-singleton GCFs - based on the upper left plot. And also that, besides the largest GCF, most of them correspond to one to three BGC types maximum - suggesting appropriate delineation of GCFs.

Step 2

Now that we have selected values for parameters controlling how populations are defined and GCFs clustered, we can simply restart lsaBGC-Pan as we did before. Usage of checkpoints within lsaBGC-Pan will avoiding re-running steps which had already completed successfully, however, GCF clustering and population delineation will always be re-run!

lsaBGC-Pan -a AntiSMASH_Results/ -o Pan_Results/ --threads 6 --run-gecco --use-panaroo -cj 30.0 -ci 5.0 -pic 99.0

Precomputed Results:

The Final_Results/ subdirectory of lsaBGC-Pan - run using v1.0.5 - can be found on this Google Drive folder.

We highlight the results from this particular tutorial to describe the different types of output lsaBGC-Pan produces on this wiki page.