Skip to content

Commit

Permalink
Add documentation on extending PopGLen
Browse files Browse the repository at this point in the history
Adds a small example case of how to replace a tool in PopGLen with another, which also is how one can extend and add rules to continue the workflow.
  • Loading branch information
zjnolen committed Feb 11, 2025
1 parent bb5c8ae commit dab1062
Show file tree
Hide file tree
Showing 4 changed files with 122 additions and 3 deletions.
117 changes: 117 additions & 0 deletions docs/extending-workflows.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Extending or modifying analyses in PopGLen

Using PopGLen as a Snakemake module makes it easy to incorporate extensions to
it in the Snakefile, enabling cusomization of the workflow. This means that you
can extend upon the workflow with your own rules that use PopGLen outputs as
their input, keeping the project as one contained workflow that can be easily
reproduced. This functionality could also allow for replacement of some of
PopGLen's analyses with alternative options, such as estimating the covariance
matrix for PCA with single read sampling in ANGSD rather than the likelihood
based approach of PCAngsd. Most of this is covered in Snakemake's documentation
on [modularization](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html),
and all methods described there should be compatible with PopGLen.

As an example, we will replace the PCA produced using PCAngsd in PopGLen
with one produced using a covariance matrix from single read sampling using
ANGSD. This should only really add one step, estimating the covariance matrix
in ANGSD instead of PCAngsd. We start with a bare bones Snakefile that imports
PopGLen as a module:

```py title="workflow/Snakefile" linenums="1"from snakemake.utils import min_version


min_version("6.10.0")


configfile: "config/config.yaml"


# declare https://github.com/zjnolen/PopGLen as a module
module popglen:
snakefile:
github("zjnolen/PopGLen", path="workflow/Snakefile", tag="v0.4.1")
config:
config


# use all rules from https://github.com/zjnolen/PopGLen
use rule * from popglen
```

As is, it will run PopGLen as described in the documentation. To utilize a
covariance matrix estimated by something other than PCAngsd when generating the
PCA, we need to do two things: (1) add a rule to generate the covariance matrix
with the other tool and (2) overwrite the input for the rule that runs the
principal component analysis with this new covariance matrix.

We will do this by adding the code block below to the end of the Snakefile. It
first defines a rule, `angsd_doCov`, which generates the covariance matrix. This
is very similar to the `angsd_doIBS` rule already in PopGLen, so much of that
rule can be repurposed into this one. After we add the rule, we then simply
tell Snakemake that when it imports the rule `plot_pca` from PopGLen, it
should use this covariance matrix made by `angsd_doCov` instead of the one from
PCAngsd defined by default.

```py title="workflow/Snakefile linenums="21"
# Add a rule which estimates the covariance matrix, but from single read
# sampling in ANGSD rather than genotype likelihoods in PCAngsd


rule angsd_doCov:
"""
Estimates covariance matrix for all individuals.
"""
input:
bamlist="results/datasets/{dataset}/bamlists/{dataset}.{ref}_{population}{dp}.bamlist",
bams=popglen.get_bamlist_bams,
bais=popglen.get_bamlist_bais,
sites="results/datasets/{dataset}/filters/snps/{dataset}.{ref}_{population}{dp}_{sites}-filts_snps.sites",
idx="results/datasets/{dataset}/filters/snps/{dataset}.{ref}_{population}{dp}_{sites}-filts_snps.sites.idx",
output:
ibs=temp("results/datasets/{dataset}/analyses/covar/{dataset}.{ref}_{population}{dp}_{sites}-filts.ibs.gz"),
covmat="results/datasets/{dataset}/analyses/covar/{dataset}.{ref}_{population}{dp}_{sites}-filts.covMat",
arg="results/datasets/{dataset}/analyses/covar/{dataset}.{ref}_{population}{dp}_{sites}-filts.arg",
container:
popglen.angsd_container
params:
mapQ=config["mapQ"],
baseQ=config["baseQ"],
trans=popglen.get_trans,
out=lambda w, output: os.path.splitext(output.arg)[0],
threads: 2
resources:
runtime="1d",
shell:
"""
angsd -doIBS 1 -doCov 1 -bam {input.bamlist} -nThreads {threads} \
-doCounts 1 -minMapQ {params.mapQ} -minQ {params.baseQ} \
-sites {input.sites} -rmTrans {params.trans} -doMajorMinor 3 \
-out {params.out}
"""


# Update plot_pca rule to use the covariance matrix from angsd_doCov rather than
# from PCAngsd


use rule plot_pca from popglen as plot_pca with:
input:
"results/datasets/{dataset}/analyses/covar/{dataset}.{ref}_{population}{dp}_{sites}-filts.covMat",
"results/datasets/{dataset}/poplists/{dataset}_{population}{dp}.indiv.list",

```

For a substitution like this, this is the only changes we need to make.
You can try running this Snakefile using the [tutorial](tutorial.md) data to see
it in action. These two methods actually produce quite similar results with the
tutorial dataset:

| PCAngsd covariance matrix | ANGSD doCov covariance matrix |
|---------------------------|-------------------------------|
|![pcangsd pca](images/pca-norel.png)|![docov pca](images/pca-docov.png)|

Extending workflows like this allows you to use PopGLen as a base for your
project, adding additional analyses as you need them. If you have an analysis
(or alternative tool for an analysis) that you feel suits PopGLen (i.e. is
suited for low-coverage population genomics), but is not yet implemented, please
feel free to open an issue and request its addition.
Binary file added docs/images/pca-docov.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 4 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ detailed [configuration](config.md) summary, a [tutorial](tutorial.md) to walk
users through running all analyses in the pipeline, as well as more specialized
documentation describing how to
[configure the pipeline to run on a HPC cluster](cluster-execution.md), how to
[modify the requested resources for jobs](resources.md), and the
[modify the requested resources for jobs](resources.md), how to
[extend the workflow with additional rules](extending-workflow.md), and the
[file paths for all major output files](outfile-summ.md) for reference.

??? question "Questions? Feature requests? Just ask!"
Expand Down Expand Up @@ -174,7 +175,7 @@ low-depth sequencing data. Molecular Ecology, 28(1), 35–48.

ANGSD: Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD:
Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15(1), Article
1. <https://doi.org/10.1186/s12859-014-0356-4> (Depth calculation, genotype
\1. <https://doi.org/10.1186/s12859-014-0356-4> (Depth calculation, genotype
likelihood estimation, SNP calling, allele frequencies, SFS, diverity and
neutrality statistics, heterozygosity, $F_{ST}$)

Expand All @@ -201,4 +202,4 @@ per-base mappability scores)
BEDTools: Quinlan, A. R. & Hall, I. M. (2010). BEDTools: a flexible suite of
utilities for comparing genomic features. Bioinformatics, 26(6), 841-842.
<https://doi.org/10.1093/bioinformatics/btq033> (Transforming and intersecting
filter files)
filter files)
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ nav:
- Tutorial: tutorial.md
- Cluster execution: cluster-execution.md
- Setting job resources: resources.md
- Extending PopGLen: extending-workflows.md
- Output file path overview: outfile-summ.md
theme:
name: material
Expand Down

0 comments on commit dab1062

Please sign in to comment.