8. explanation of final spreadsheet and visual reports

We will use results from the tutorial on lsaBGC-Pan analysis of the pan-BGC-ome of Streptomyces olivaceus. The Final_Results/ subdirectory for the analysis can be found in this Google Drive folder.

Explanation of Consolidated Spreadsheet Tables

Spreadsheet location: Pan_Results/Final_Results/Consolidated_Spreadsheet.xlsx

Example spreadsheet

Tab 1: "Explanation of Results"

A link to this wiki page to explain the contents of the other sheets.

Tab 2: "BGC Overview"

This tab features an overview of BGCs across all genomes along with information on which populations samples/genomes are grouped into and which GCFs the BGCs are grouped into.

Column Descriptions:

Column	Description
sample	Sample/genome identifier.
population	The population/clade that sample was assigned to.
method	The BGC prediction method (either antiSMASH or GECCO).
genome_path	The path to the full genome in GenBank format.
bgc_id	The BGC identifier.
bgc_path	The path to the BGC in GenBank format.
gcf_id	The GCF identifier the BGC belongs to.
scaffold	The scaffold identifier the BGC is found on.
start	The start coordinate for the BGC.
end	The end coordinate for the BGC.
bgc_length	The length of the BGC in bp.
dist_to_edge	The minimal distance of the BGC to the start/end of the scaffold/contig it is on.

Tab 3: "zol Results"

This tab provides gene-resolution information on the conservation, evolutionary trends, and functional annotation of orthogroups across GCFs. It uses zol to compute the information and replaces lsaBGC-PopGene.

Column Descriptions:

Column	Description
GCF ID	The GCF identifier.
Ortholog Group (OG) ID	The orthogroup identifier.
OG is Single Copy?	Is the orthogroup single copy?
Proportion of Total Gene Cluster Instances with OG	The proportion of total GCF instances which feature the orthogroup. Note, by default non-representative paralogous BGC instances are still filtered out (when two or more BGC instances are found in the same sample). See option `--zol-keep-multi-copy` in lsaBGC-Pan.
Proportion of Complete Gene Cluster Instances with OG	The proportion of complete GCF instances (not near contig edges) which feature the orthogroup. Again, paralogous instances are filtered out by default.
columns F ... onwards	Descriptions of columns F onwards can be found on this zol wiki page. Note, these data reflect comprehensive analysis - not just complete instances.

Tab 4: "lsaBGC-MIBiGMapper Results"

This tab shows mapping information of GCFs to reference/characterized BGCs in the ever-so-useful MIBiG database. By default lsaBGC-MIBiGMapper requires 5 proteins from the focal GCF mapping to proteins from a single reference MIBiG BGC at >=80% identity and >=70% coverage of the reference BGC.

If you would like to have these options accessible in lsaBGC-Pan - open up a GitHub ticket and just give us a nudge to do it!

Column Descriptions:

Column	Description
GCF ID	GCF identifier.
MIBiG BGC ID	The matching MIBiG reference BGC identifier.
GCF OG ID	The GCF orthogroup ID.
MIBiG Protein Matching	The matching protein in the MIBiG reference BGC.
MIBiG Compound(s)	The compounds associated with the MIBiG reference BGC.

Tab 5: "lsaBGC-Reconcile Results"

This tab depicts an overview of BGC associated orthogroups and metrics to help identify those that might have been horizontally transferred.

Column Descriptions:

Column	Description
orthogroup	Orthogroup identifier.
GCF count	The number of distinct GCFs the orthogroup is found within.
found in non-BGC context	Whether the orthogroup is found in a non-BGC context.
population count total	The number of distinct populations the orthogroup is found within.
population count in BGC context	The number of distinct populations the orthogroup is found within a BGC context specifically.
GCFs	The list of GCFs the orthogroup is found in.
conservation total	The proportion of genomes the orthogroup is found within.
conservation in BGC context	The proportion of genomes the orthogroup is found within a BGC context within.
norm_max_bd	max_bd / mean_bd
mean_bd	The average phylogenetic branch distance ratio between leafs in the gene tree and species tree.
max_bd	The maximum phylogenetic branch distance ratio between leafs in the gene tree and species tree.
population/clade specific conservation metrics...	The proportion of a single population/clade's genomes the orthogroup has been found in.

Tab 6: "BGC OG by Sample Matrix"

A two-header matrix file where the first row corresponds to the genome/sample identifiers and the second row indicates the populations they belong to. The columns are sorted in primary by the population identifiers. The rows of the matrix after these two header columns correspond to the copy count of BGC-associated orthogroups across the different samples/genomes.

Tab 7: "lsaBGC-Sociate Results"

This tab shows results from performing genome-wide association testing (GWAS) to identify orthogroups and alternate GCFs associated or de-associated with focal GCFs.

⚠️ Doing GWAS generally benefits heavily from the inclusion of more samples and traits being interspersed phylogenetically. While we use the "lmm" model in pyseer to adjust p-values for phylogenetic dispersion of associated orthogroups/GCFs with focal GCFs and apply Bonferroni multiple testing correction, you can still end up with false positives if working with a small number of samples. Do not assess if you have less than 20 samples and ideally incorporate at least 100 samples if this module is your primary interest.

Annotations simply require an E-value < 1e-5 but the best annotation for the consensus sequence of an orthogroup is selected based on score or bitscore.

Column Descriptions:

Column	Description
focal GCF	The focal GCF that we are looking for co-occurence (de-)associations with.
associated GCF/OG	The associated GCF or orthogroup identifier with the focal GCF.
allele frequency	The allele frequency of the associated GCF or orthogroup.
pvalue	The un-adjusted pvalue.
phylogenetically corrected pvalue	The phylogenetically corrected p-value based on the lmm model.
beta	The effect size/slope of the associated feature.
beta-std-err	"the standard error of the fit on beta" - pyseer documentation.
variant_h2	"the variance in phenotype [focal GCF presence] explained by the variant" - pyseer documentation.
notes	"Notes about the fit" from the pyseer run.
KO Annotation (E-value)	Best KEGG ortholog annotation(s) (the HMMER3 E-value associated with the best score)
PGAP Annotation (E-value)	Best PGAP annotation(s) (the HMMER3 E-value associated with the best score)
PaperBLAST Annotation (E-value)	Best PaperBLAST annotation(s) (the DIAMOND E-value associated with the best bitscore). For associated papers BLAST the consensus sequence or the ID here to on the PaperBLAST webpage.
CARD Annotation (E-value)	Best CARD annotation(s) of antimicrobial resistance genes (the DIAMOND E-value associated with the best bitscore)
IS Finder (E-value)	Best ISFinder annotation(s) of IS elements / transposons (the DIAMOND E-value associated with the best bitscore)
MIBiG Annotation (E-value)	Best MIBiG annotation(s) for genes in characterized BGCs (the DIAMOND E-value associated with the best bitscore)
VOG Annotation (E-value)	Best VOG annotation(s) for viral/phage ortholog groups (the HMMER3 E-value associated with the best score)
Pfam Domains	Pfam domains with E-value < 1e-5 and meeting the "trusted" score thresholds.

Visual Results

All visuals from lsaBGC-Pan have Rscripts for creating them nearby the plots - and if users are familiar with R - they can be easily adjusted to redo scaling (e.g. size of PDFs) to make the figures better suited for publication. Users then would simply re-run them, e.g. Rscript some_rscript.R to recreate the figures. For more details see the tutorial for analysis of the pan-BGC-ome of two Cutibacterium species.