-
Notifications
You must be signed in to change notification settings - Fork 2
8. explanation of final spreadsheet and visual reports
We will use results from the tutorial on lsaBGC-Pan analysis of the pan-BGC-ome of Streptomyces olivaceus. The Final_Results/
subdirectory for the analysis can be found in this Google Drive folder.
Spreadsheet location: Pan_Results/Final_Results/Consolidated_Spreadsheet.xlsx
A link to this wiki page to explain the contents of the other sheets.
This tab features an overview of BGCs across all genomes along with information on which populations samples/genomes are grouped into and which GCFs the BGCs are grouped into.
Column Descriptions:
Column | Description |
---|---|
sample | Sample/genome identifier. |
population | The population/clade that sample was assigned to. |
method | The BGC prediction method (either antiSMASH or GECCO). |
genome_path | The path to the full genome in GenBank format. |
bgc_id | The BGC identifier. |
bgc_path | The path to the BGC in GenBank format. |
gcf_id | The GCF identifier the BGC belongs to. |
scaffold | The scaffold identifier the BGC is found on. |
start | The start coordinate for the BGC. |
end | The end coordinate for the BGC. |
bgc_length | The length of the BGC in bp. |
dist_to_edge | The minimal distance of the BGC to the start/end of the scaffold/contig it is on. |
This tab provides gene-resolution information on the conservation, evolutionary trends, and functional annotation of orthogroups across GCFs. It uses zol to compute the information and replaces lsaBGC-PopGene.
Column Descriptions:
Column | Description |
---|---|
GCF ID | The GCF identifier. |
Ortholog Group (OG) ID | The orthogroup identifier. |
OG is Single Copy? | Is the orthogroup single copy? |
Proportion of Total Gene Cluster Instances with OG | The proportion of total GCF instances which feature the orthogroup. Note, by default non-representative paralogous BGC instances are still filtered out (when two or more BGC instances are found in the same sample). See option --zol-keep-multi-copy in lsaBGC-Pan. |
Proportion of Complete Gene Cluster Instances with OG | The proportion of complete GCF instances (not near contig edges) which feature the orthogroup. Again, paralogous instances are filtered out by default. |
columns F ... onwards | Descriptions of columns F onwards can be found on this zol wiki page. Note, these data reflect comprehensive analysis - not just complete instances. |
This tab shows mapping information of GCFs to reference/characterized BGCs in the ever-so-useful MIBiG database. By default lsaBGC-MIBiGMapper requires 5 proteins from the focal GCF mapping to proteins from a single reference MIBiG BGC at >=80% identity and >=70% coverage of the reference BGC.
If you would like to have these options accessible in lsaBGC-Pan - open up a GitHub ticket and just give us a nudge to do it!
Column Descriptions:
Column | Description |
---|---|
GCF ID | GCF identifier. |
MIBiG BGC ID | The matching MIBiG reference BGC identifier. |
GCF OG ID | The GCF orthogroup ID. |
MIBiG Protein Matching | The matching protein in the MIBiG reference BGC. |
MIBiG Compound(s) | The compounds associated with the MIBiG reference BGC. |
This tab depicts an overview of BGC associated orthogroups and metrics to help identify those that might have been horizontally transferred.
Column Descriptions:
Column | Description |
---|---|
orthogroup | Orthogroup identifier. |
GCF count | The number of distinct GCFs the orthogroup is found within. |
found in non-BGC context | Whether the orthogroup is found in a non-BGC context. |
population count total | The number of distinct populations the orthogroup is found within. |
population count in BGC context | The number of distinct populations the orthogroup is found within a BGC context specifically. |
GCFs | The list of GCFs the orthogroup is found in. |
conservation total | The proportion of genomes the orthogroup is found within. |
conservation in BGC context | The proportion of genomes the orthogroup is found within a BGC context within. |
norm_max_bd | max_bd / mean_bd |
mean_bd | The average phylogenetic branch distance ratio between leafs in the gene tree and species tree. |
max_bd | The maximum phylogenetic branch distance ratio between leafs in the gene tree and species tree. |
population/clade specific conservation metrics... | The proportion of a single population/clade's genomes the orthogroup has been found in. |
A two-header matrix file where the first row corresponds to the genome/sample identifiers and the second row indicates the populations they belong to. The columns are sorted in primary by the population identifiers. The rows of the matrix after these two header columns correspond to the copy count of BGC-associated orthogroups across the different samples/genomes.
This tab shows results from performing genome-wide association testing (GWAS) to identify orthogroups and alternate GCFs associated or de-associated with focal GCFs.
⚠️ Doing GWAS generally benefits heavily from the inclusion of more samples and traits being interspersed phylogenetically. While we use the "lmm" model in pyseer to adjust p-values for phylogenetic dispersion of associated orthogroups/GCFs with focal GCFs and apply Bonferroni multiple testing correction, you can still end up with false positives if working with a small number of samples. Do not assess if you have less than 20 samples and ideally incorporate at least 100 samples if this module is your primary interest.
Annotations simply require an E-value < 1e-5 but the best annotation for the consensus sequence of an orthogroup is selected based on score or bitscore.
Column Descriptions:
Column | Description |
---|---|
focal GCF | The focal GCF that we are looking for co-occurence (de-)associations with. |
associated GCF/OG | The associated GCF or orthogroup identifier with the focal GCF. |
allele frequency | The allele frequency of the associated GCF or orthogroup. |
pvalue | The un-adjusted pvalue. |
phylogenetically corrected pvalue | The phylogenetically corrected p-value based on the lmm model. |
beta | The effect size/slope of the associated feature. |
beta-std-err | "the standard error of the fit on beta" - pyseer documentation. |
variant_h2 | "the variance in phenotype [focal GCF presence] explained by the variant" - pyseer documentation. |
notes | "Notes about the fit" from the pyseer run. |
KO Annotation (E-value) | Best KEGG ortholog annotation(s) (the HMMER3 E-value associated with the best score) |
PGAP Annotation (E-value) | Best PGAP annotation(s) (the HMMER3 E-value associated with the best score) |
PaperBLAST Annotation (E-value) | Best PaperBLAST annotation(s) (the DIAMOND E-value associated with the best bitscore). For associated papers BLAST the consensus sequence or the ID here to on the PaperBLAST webpage. |
CARD Annotation (E-value) | Best CARD annotation(s) of antimicrobial resistance genes (the DIAMOND E-value associated with the best bitscore) |
IS Finder (E-value) | Best ISFinder annotation(s) of IS elements / transposons (the DIAMOND E-value associated with the best bitscore) |
MIBiG Annotation (E-value) | Best MIBiG annotation(s) for genes in characterized BGCs (the DIAMOND E-value associated with the best bitscore) |
VOG Annotation (E-value) | Best VOG annotation(s) for viral/phage ortholog groups (the HMMER3 E-value associated with the best score) |
Pfam Domains | Pfam domains with E-value < 1e-5 and meeting the "trusted" score thresholds. |
All visuals from lsaBGC-Pan have Rscripts for creating them nearby the plots - and if users are familiar with R - they can be easily adjusted to redo scaling (e.g. size of PDFs) to make the figures better suited for publication. Users then would simply re-run them, e.g.
Rscript some_rscript.R
to recreate the figures. For more details see the tutorial for analysis of the pan-BGC-ome of two Cutibacterium species.
GSeeF produces a plot showing the presence of GCFs across the species phylogeny.
-
Script location:
Pan_Results/Final_Results/Visualizations/GseeF_Results/gseef_rscript.R
-
Plot location:
Pan_Results/Final_Results/Visualizations/GSeeF_Results/Final_Results/Phylogenetic_Heatmap.png
-
Legend location:
Pan_Results/Final_Results/Visualizations/GSeeF_Results/Final_Results/Annotation_Legend.png
Example from Streptomyces olivaceus tutorial:
These plots are made for each individual GCF. It shows a schematic of BGCs belonging to a GCF across a species tree. Additional information on how lsaBGC-See works can be found on its original lsaBGC wiki page.
-
Script location:
Pan_Results/Final_Results/Visualizations/lsaBGC_See_Results/GCF_X/plot_with_species_phylo.R
-
Plot location:
Pan_Results/Final_Results/Visualizations/lsaBGC_See_Results/GCF_X/BGC_Visualization.species_phylogeny.pdf
Example from Streptomyces olivaceus tutorial:
lsaBGC-ComprehenSeeIve Plots - genome-wide orthogroup presence/absence for BGCs across a species phylogeny
These plots are made for each individual GCF. It shows a heatmap for the presence of orthogroups associated with the GCF across the entire genome of samples. Additional information on how lsaBGC-ComprehenSeeIve works can be found on its original lsaBGC wiki page.
-
Script location:
Pan_Results/Final_Results/Visualizations/lsaBGC_ComprehenSeeIve_Results/GCF_X/plot_with_species_phylo.R
-
Plot location:
Pan_Results/Final_Results/Visualizations/lsaBGC_ComprehenSeeIve_Results/GCF_X/BGC_Visualization.species_phylogeny.pdf
Example from Streptomyces olivaceus tutorial:
These plots are made for each individual orthogroup that is found within a BGC context. They show a gene phylogeny of the orthogroup constructed using MUSCLE super5 alignment and FastTree2 alongside tracks indicating the context the gene instance is found within (which GCF or whether it is not in a GCF context) and the population the genome with the gene belongs to. The "reconcile" part of the name is because the overlay of population information on the gene tree allows users to see indications of horizontal transfer.
-
Script locations:
Pan_Results/Final_Results/Visualizations/lsaBGC_Reconcile_Results/BGC_OG_PhyloViz_Scripts/
-
Plot locations:
Pan_Results/Final_Results/Visualizations/lsaBGC_Reconcile_Results/BGC_OG_Phylogenetic_Visualizations/
Pan_Results/Final_Results/Visualizations/lsaBGC_ComprehenSeeIve_Results/GCF_X/BGC_Visualization.species_phylogeny.pdf
-
Population legend location:
Pan_Results/Final_Results/Visualizations/population_coloring.pdf
-
Context legend location:
Pan_Results/Final_Results/Visualizations/gcf_coloring.pdf
Example from Streptomyces olivaceus tutorial:
Note, for the GCF track, the first column, gray specifically means non-BGC context, but grey is used as just another clade color in the population track.
These plots are made for each GCF. cgc is a program in the zol suite (a dependency of lsaBGC-Pan) which visualizes zol results. More information on cgc can be found on this zol wiki page. Note, it is probably easier to rerun cgc rather than update the Rscript associated with a GCF.
-
Script locations:
Pan_Results/Final_Results/Visualizations/cgc_Results/GCF_X/cgc_script.R
-
Plot locations:
Pan_Results/Final_Results/Visualizations/cgc_Results/GCF_X/cgc_plot.png
Example from Streptomyces olivaceus tutorial:
These plots are made for each GCF which has associated/de-sociated features (orthogroups or other GCFs) across the pangenome. The figure is a phylogenetic heatmap where the first track (in black) is the presence of the focal GCF. Then the following tracks in order of lowest phylogenetically corrected p-value (left) to highest p-value (right) are associated features (orthogroups or other GCFs, red = negative effect size, blue = positive effect size)
-
Script locations:
Pan_Results/Final_Results/Visualizations/lsaBGC_Sociate_Visual_Results/Rscripts/
-
Plot locations:
Pan_Results/Final_Results/Visualizations/lsaBGC_Sociate_Visual_Results/Plots/
Example from Streptomyces olivaceus tutorial: