Skip to content

6. tutorial on AntiSMASH analysis of BGCs from Aspergillus flavus

Rauf Salamzade edited this page Aug 8, 2024 · 10 revisions

Overview

This tutorial applies lsaBGC-Pan to investigate BGCs from Aspergillus flavus. It is inspired by the cool study by Drott et al. 2021: Microevolution in the pansecondary metabolome of Aspergillus flavus and its potential macroevolutionary implications for filamentous fungi where the authors identified 3 populations within the species and assessed the conservation of different BGCs across them.

For this tutorial, we downloaded all 13 genomes belonging to the species with CDS features available on NCBI and processed them through antiSMASH.

Download the input dataset

The input for this tutorial are pre-computed antiSMASH (v7.0.0) results for A. flavus that can be downloaded off FigShare:

# download the dataset
wget https://figshare.com/ndownloader/files/48144640

# uncompress it 
tar -zxvf 48144640

You should have a folder called AntiSMASH_Results/ in your current workspace now.

Note, running this tutorial on a laptop - make sure you are using the smaller sized database build and be prepared for a long run that uses a lot of your machines threads. I don't have a MacBook Pro - but most appear to have 8 threads available - which means you probably would run this with 6 threads which is not a lot and because Aspergillus are so rich in BGCs, this can take upwards to 10 hours to run to completion and require nearly 16 GB (if not more) of memory (which should be standard for most laptops).

Step 1:

As in the other tutorials, let's run lsaBGC-Pan with a general command for the first part, since we don't know what parameters are most appropriate:

lsaBGC-Pan -a AntiSMASH_Results/ -o Pan_Results/ --threads 6 --fungal 

We do not specify arguments for using GECCO for additional BGC predictions because it is designed for bacteria and similarly do not request Panaroo based orthology inference because the method is also specific to bacteria.

As described in the Streptomyces associated tutorial, you will see a message around what is expected in the input directory of antiSMASH results. If all goes well, lsaBGC-Pan should come to break to allow users to adjust parameters for defining populations/clades and controlling the granularity of BGC clustering into GCFs.

Selecting population stratification parameters

We can assess PDFs in the folder Pan_Results/Delineate_Populations/ to see if we find a value for the -pic argument which best achieves a desirable clading of 3 populations (as determined by Drott et al. 2021) or around that number. The -pic parameter is the protein % identity cutoff of the core genome to define samples/genomes as belonging to the same population. None of the cutoffs lead to particularly great partitions, but we see that multiple populations do at least form when we use a value of 99.5.

Selecting parameters controlling clustering of BGCs into GCFs

To assess which parameters to apply for clustering BGCs into GCFs, we can check out the file GCF_Clustering/Plots_Depicting_Parameter_Influence_on_GCF_Clustering.pdf. The first page in the report will show the number of GCFs (x-axis) vs. the proportion of total BGCs that are singletons (sole members of their GCFs). We see that singleton BGCs are not all that common and we can get pretty stringent with criteria for grouping BGCs together without incurring lots of singleton GCFs. This makes sense for fungi, which are still eukaryotes and likely have much less flexible genomes relative to bacteria.

Looking at the second page of the report for additional insight we find that the: (i) the MCL inflation parameter does not lead to much differences of GCF clustering and (ii) in terms of annotation consistency (red heatmap) - there is not much difference between using a Jaccard Index cutoff of 20.0 - 75.0.

Next, we can have a look at third page, where we see that the standard deviation of gene counts for BGCs belonging to the same GCF is lowest when a Jaccard Index cutoff of 30.0 is used:

We thus elect to use an MCL inflation parameter of 5.0 and a Jaccard Index cutoff of 30.0

Step 2

Now that we have selected values for parameters controlling how populations are defined and GCFs clustered, we can simply restart lsaBGC-Pan as we did before. Usage of checkpoints within lsaBGC-Pan will avoiding re-running steps which had already completed successfully, however, GCF clustering and population delineation will always be re-run!

lsaBGC-Pan -a AntiSMASH_Results/ -o Pan_Results/ --threads 6 --fungal -cj 30.0 -ci 5.0 -pic 99.5

Precomputed Results:

The Final_Results/ subdirectory of lsaBGC-Pan - run using v1.0.5 - can be found on this Google Drive folder.