From dbe1e9f32791df4e4a5f6439112484b906db3862 Mon Sep 17 00:00:00 2001
From: Docs Deploy <docs.deploy@example.com>
Date: Mon, 24 Jun 2024 14:12:28 +0000
Subject: [PATCH] Deployed 0edb4d7 to develop with MkDocs 1.6.0 and mike 2.1.1

---
 develop/index.html               | 4 ++--
 develop/search/search_index.json | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/develop/index.html b/develop/index.html
index afbbfa1..151db27 100644
--- a/develop/index.html
+++ b/develop/index.html
@@ -491,10 +491,10 @@ <h2 id="pipeline-summary">Pipeline Summary</h2>
 <li>Relatedness (NGSRelate, IBSrelate)</li>
 <li>Identity by state matrix (ANGSD)</li>
 <li>Site frequency spectrum (ANGSD)</li>
-<li>Watterson's estimator (<em>θ~w~</em>), Nucleotide diversity (<em>π</em>), Tajima's <em>D</em>
+<li>Watterson's estimator (<em>θ<sub>w</sub></em>), Nucleotide diversity (<em>π</em>), Tajima's <em>D</em>
   (ANGSD)</li>
 <li>Individual heterozygosity with bootstrapped confidence intervals (ANGSD)</li>
-<li>Pairwise <em>F</em>~ST~ (ANGSD)</li>
+<li>Pairwise <em>F</em><sub>ST</sub> (ANGSD)</li>
 </ul>
 <p>These all can be enabled and processed independently, and the pipeline will
 generate genotype likelihood input files using ANGSD and share them across
diff --git a/develop/search/search_index.json b/develop/search/search_index.json
index b62ef6e..4a4a00f 100644
--- a/develop/search/search_index.json
+++ b/develop/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to the documentation for PopGLen","text":"<p>PopGLen is aimed at enabling users to run population genomic analyses on their data within a genotype likelihood framework in an automated and reproducible fashion. Genotype likelihood based analyses avoid genotype calling, instead performing analyses on the likelihoods of each possible genotype, incorporating uncertainty about the true genotype into the analysis. This makes them especially suited for datasets with low coverage or that vary in coverage.</p> <p>This pipeline was developed in large part to make my own analyses easier. I work with many species being mapped to their own references within the same project. I developed this pipeline so that I could ensure standardized processing for datasets within the same project and to automate the many steps that go into performing these analyses. As it needed to fit many datasets, it is generalizable and customizable through a single configuration file and uses a common workflow utilized by ANGSD users, so it is available for others to use, should it suit their needs.</p> Questions? Feature requests? Just ask! <p>I'm glad to answer questions on the GitHub Issues page for the project, as well as take suggestions for features or improvements!</p>"},{"location":"#pipeline-summary","title":"Pipeline Summary","text":"<p>The pipeline aims to follow the general path many users will use when working with ANGSD and other GL based tools. Raw equencing data is processed into BAM files (with optional configuration for historical degraded samples) or BAM files are provided directly. From there several quality control reports are generated to help determine what samples are included. The pipeline then builds a 'sites' file to perform analyses with. This sites file is made from several user-configured filters, intersecting all and outputing a list of sites for the analyses to be performed on across all samples. This can also be extended by user-provided filter lists (e.g. to limit to neutral sites, genic regions, etc.).</p> <p>After samples have been processed, quality control reports produced, and the sites file has been produced, the pipeline can continue to the analyses.</p> <ul> <li>Linkage disequilibrium estimation, LD decay, LD pruning (ngsLD)</li> <li>PCA (PCAngsd)</li> <li>Admixture (NGSAdmix)</li> <li>Inbreeding/Runs of Homozygosity (ngsF-HMM)</li> <li>Relatedness (NGSRelate, IBSrelate)</li> <li>Identity by state matrix (ANGSD)</li> <li>Site frequency spectrum (ANGSD)</li> <li>Watterson's estimator (\u03b8~w~), Nucleotide diversity (\u03c0), Tajima's D   (ANGSD)</li> <li>Individual heterozygosity with bootstrapped confidence intervals (ANGSD)</li> <li>Pairwise F~ST~ (ANGSD)</li> </ul> <p>These all can be enabled and processed independently, and the pipeline will generate genotype likelihood input files using ANGSD and share them across analyses as appropriate, deleting temporary intermediate files when they are no longer needed.</p> <p>At any point after a successful completion of a portion of the pipeline, a report can be made that contains tables and figures summarizing the results for the currently enabled parts of the pipeline.</p> <p>If you're interested in using this, head to the Getting Started page!</p>"},{"location":"config/","title":"Configuring the workflow","text":"<p>Running the workflow requires configuring three files: <code>config.yaml</code>, <code>samples.tsv</code>, and <code>units.tsv</code>. <code>config.yaml</code> is used to configure the analyses, <code>samples.tsv</code> categorizes your samples into groups, and <code>units.tsv</code> connects sample names to their input data files. The workflow will use <code>config/config.yaml</code> automatically, but you can name this whatever you want (good for separating datasets in same working directory) and point to it when running snakemake with <code>--configfile &lt;path&gt;</code>.</p>"},{"location":"config/#samplestsv","title":"<code>samples.tsv</code>","text":"<p>This file contains your sample list, and has four tab separated columns:</p> <pre><code>sample\tpopulation\ttime\tdepth\nhist1\tHjelmseryd\thistorical\tlow\nhist2\tHjelmseryd\thistorical\tlow\nhist3\tHjelmseryd\thistorical\tlow\nmod1\tGotafors\tmodern\thigh\nmod2\tGotafors\tmodern\thigh\nmod3\tGotafors\tmodern\thigh\n</code></pre> <ul> <li> <p><code>sample</code> contains the ID of a sample. It is best if it only contains   alphanumeric characters.</p> </li> <li> <p><code>population</code> contains the population the sample comes from and will be used   to group samples for population-level analyses.</p> </li> <li> <p><code>time</code> sets whether a sample should be treated as fresh DNA or historical DNA   in the sequence processing workflow. Doesn't change anything if you're   starting with bam files.</p> </li> <li> <p><code>depth</code> puts the sample in a sequencing depth category. Used for filtering -   if enabled in the configuration, extreme depth filters will be performed for   depth categories individually.</p> </li> </ul>"},{"location":"config/#unitstsv","title":"<code>units.tsv</code>","text":"<p>This file connects your samples to input files and has a potential for eight tab separated columns:</p> <pre><code>sample\tunit\tlib\tplatform\tfq1\tfq2\tbam\tsra\nhist1\tBHVN22DSX2.2\thist1\tILLUMINA\tdata/fastq/hist1.r1.fastq.gz\tdata/fastq/hist1.r2.fastq.gz\t\nhist1\tBHVN22DSX2.3\thist1\tILLUMINA\tdata/fastq/hist1.unit2.r1.fastq.gz\tdata/fastq/hist1.unit2.r2.fastq.gz\t\nhist2\tBHVN22DSX2.2\thist2\tILLUMINA\tdata/fastq/hist2.r1.fastq.gz\tdata/fastq/hist2.r2.fastq.gz\t\nhist3\tBHVN22DSX2.2\thist2\tILLUMINA\tdata/fastq/hist3.r1.fastq.gz\tdata/fastq/hist3.r2.fastq.gz\t\nmod1\tAHW5NGDSX2.3\tmod1\tILLUMINA\tdata/fastq/mod1.r1.fastq.gz\tdata/fastq/mod1.r2.fastq.gz\t\nmod2\tAHW5NGDSX2.3\tmod2\tILLUMINA\t\t\tdata/bam/mod2.bam\nmod3\tAHW5NGDSX2.3\tmod3\tILLUMINA\tdata/fastq/mod3.r1.fastq.gz\tdata/fastq/mod3.r2.fastq.gz\t\nSAMN13218652\tSRR10398077\tSAMN13218652\tILLUMINA\t\t\t\tSRR10398077\n</code></pre> <ul> <li><code>sample</code> contains the ID of a sample. Must be same as in <code>samples.tsv</code> and   may be listed multiple times when inputting multiple sequencing runs/libraries.</li> <li><code>unit</code> contains the sequencing unit, i.e. the sequencing lane barcode and   lane number. This is used in the PU and (part of) the ID read groups. If you   don't have multiple sequencing lanes per samples, this won't impact anything.   Doesn't do anything when using bam input.</li> <li><code>lib</code> contains the name of the library identifier for the entry. Fills in   the LB and (part of) the ID read groups and is used for PCR duplicate removal.   Best practice would be to have the combination of <code>unit</code> and <code>lib</code> to be unique   per line. An easy way to use this is to use the Illumina library identifier or   another unique library identifier, or simply combine a generic name with the   sample name (sample1A, sample1B, etc.). Doesn't do anything when using bam   input.</li> <li><code>platform</code> is used to fill the PL read group. Commonly is just 'ILLUMINA'.   Doesn't do anything when using bam input.</li> <li><code>fq1</code> and <code>fq2</code> provides the absolute or relative to the working directory   paths to the raw sequencing files corresponding to the metadata in the previous   columns.</li> <li><code>bam</code> provides the absolute or relative to the working directory path of   pre-processed bam files. Only one bam files should be provided per sample in   the units file.</li> <li><code>sra</code> provides the NCBI SRA accession number for a set of paired end fastq   files that will be downloaded to be processed. If a sample has multiple runs   you would like to include, each run should be its own line in the units sheet,   just as separate sequencing runs would be.</li> </ul> <p>Mixing samples with different starting points</p> <p>It is possible to have different samples start from different inputs (i.e. some from bam, others from fastq, others from SRA). It is best to provide only <code>fq1</code>+<code>fq2</code>, <code>bam</code>, or <code>sra</code> for each sample to be clear where each sample starts. If multiple are provided for the same sample, the bam file will override fastq or SRA entries, and the fastq will override SRA entries. Note that this means it is not currently possible to have multiple starting points for the same sample (i.e. FASTQ reads that would be processed then merged into an existing BAM).</p>"},{"location":"config/#configuration-file","title":"Configuration file","text":"<p><code>config.yaml</code> contains the configuration for the workflow, this is where you will put what analyses, filters, and options you want. Below I describe the configuration options. The <code>config.yaml</code> in this repository serves as a template, but includes some 'default' parameters that may be good starting points for some users. If <code>--configfile</code> is not specified in the snakemake command, the workflow will default to <code>config/config.yaml</code>.</p>"},{"location":"config/#configuration-options","title":"Configuration options","text":""},{"location":"config/#dataset-configuration","title":"Dataset Configuration","text":"<p>Required configuration of the 'dataset'.</p> <ul> <li><code>samples:</code> An absolute or relative path from the working directory to the   <code>samples.tsv</code> file.</li> <li><code>units:</code> An absolute or relative path from the working directory to the   <code>units.tsv</code> file.</li> <li><code>dataset:</code> A name for this dataset run - essentially, an identifier for a   batch of samples to be analysed together with the same configuration.</li> </ul> <p>Here, dataset means a set of samples and configurations that the workflow will be run with. Each dataset should have its own <code>samples.tsv</code> and <code>config.yaml</code>, but the same <code>units.tsv</code> can be used for multiple if you prefer. Essentially, what the dataset identifier does is keeps your outputs organized into projects, so that the same BAM files can be used in multiple datasets without having to be remade.</p> <p>So, say you have <code>dataset1_samples.tsv</code> and <code>dataset2_samples.tsv</code>, with corresponding <code>dataset1_config.tsv</code> and <code>dataset2_config.yaml</code>. The sample files contain different samples, though some are shared between the datasets. The workflow for dataset1 can be run, and then dataset2 can be run. When dataset2 runs, it map new samples, but won't re-map samples processed in dataset1. Each will perform downstream analyses independently with their sample set and configuration files, storing these results in dataset specific folders.</p>"},{"location":"config/#reference-configuration","title":"Reference Configuration","text":"<p>Required configuration of the reference.</p> <ul> <li> <p><code>chunk_size:</code> A size in bp (integer). Your reference will be analyzed in   'chunks' of contigs of this size to parallelize processing. This size should   be larger than the largest contig in your genome. A larger number means fewer   jobs that run longer. A smaller number means more jobs that run shorter. The   best fit will depend on the reference and the compute resources you have   available. Leaving this blank will not divide the reference up into chunks   (but this isn't optimized yet, so it will do a couple unnecessary steps).</p> </li> <li> <p><code>reference:</code></p> </li> <li><code>name:</code> A name for your reference genome, will go in the file names.</li> <li><code>fasta:</code> A path to the reference fasta file (currently only supports     uncompressed fasta files)</li> <li><code>mito:</code> Mitochondrial contig name(s), will be removed from analysis. Should     be listed within brackets []</li> <li><code>sex-linked:</code> Sex-linked contig name(s), will be removed from analysis.     Should be listed within brackets []</li> <li><code>exclude:</code> Additional contig name(s) to exclude from analysis. Should be     listed within brackets []</li> <li> <p><code>min_size:</code> A size in bp (integer). All contigs below this size will be     excluded from analysis.</p> </li> <li> <p><code>ancestral:</code> A path to a fasta file containing the ancestral states in your   reference genome. This is optional, and is used to polarize allele   frequencies in SAF files to ancestral/derived. If you leave this empty,   the reference genome itself will be used as ancestral, and you should be   sure the [<code>params</code>] [<code>realSFS</code>] [<code>fold</code>] is set to <code>1</code>. If you put a fasta   here, you can set that to <code>0</code>.</p> </li> </ul> <p>Reference genomes should be uncompressed, and contig names should be clear and concise. Currently, there are some issues parsing contig names with underscores, so please change these in your reference before running the pipeline. Alphanumeric characters, as well as <code>.</code> in contig names have been tested to work so far, other symbols have not been tested.</p> <p>Potentially the ability to use bgzipped genomes will be added, I just need to check that it works with all underlying tools. Currently, it will for sure not work, as calculating chunks is hard-coded to work on an uncompressed genome.</p>"},{"location":"config/#sample-set-configuration","title":"Sample Set Configuration","text":"<ul> <li><code>exclude_ind:</code> Sample name(s) that will be excluded from the workflow. Should   be a list in []. Putting a <code>#</code> in front of the sample in the sample list also   works. Mainly used to drop samples with poor quality after initial processing.</li> <li><code>excl_pca-admix:</code> Sample name(s) that will be excluded only from PCA and   Admixture analyses. Useful for close relatives that violate the assumptions   of these analyses, but that you want in others. Should be a list in []. If you   want relatives out of all downstream analyses, not just PCA/Admix, put them in   <code>exclude_ind</code> instead. Note this will trigger a re-run for relatedness   analyses, but you can just disable them now as they've already been run.</li> </ul>"},{"location":"config/#analysis-selection","title":"Analysis Selection","text":"<p>Here, you will define which analyses you will perform. It is useful to start with only a few, and add more in subsequent workflow runs, just to ensure you catch errors before you use compute time running all analyses. Most are set with (<code>true</code>/<code>false</code>) or a value, described below. Modifications to the settings for each analysis are set in the next section.</p> <ul> <li> <p><code>populations:</code> A list of populations found in your sample list to limit   population analyses to. Might be useful if you want to perform individual   analyses on some samples but not include them in any population level   analyses. Leave blank (<code>[]</code>) if you want population level analyses on all the   populations defined in your <code>samples.tsv</code> file.</p> </li> <li> <p><code>analyses:</code></p> </li> <li><code>mapping:</code><ul> <li><code>historical_only_collapsed:</code> Historical samples are expected to have   fragmented DNA. For this reason, overlapping (i.e. shorter, usually   &lt;270bp) read pairs are collapsed in this workflow for historical samples.   Setting this option to <code>true</code> will only map only these collapsed reads,   and is recommended to target primarily endogenous content. However, in   the event you want to map both the collapsed and uncollapsed reads, you   can set this to <code>false</code>. (<code>true</code>/<code>false</code>)</li> <li><code>historical_collapsed_aligner:</code> Aligner used to map collapsed historical   sample reads. <code>aln</code> is the recommended for this, but this is here in case   you would like to select <code>mem</code> for this. Uncollapsed historical reads   will be mapped with <code>mem</code> if <code>historical_only_collapsed</code> is set to   <code>false</code>, regardless of what is put here. (<code>aln</code>/<code>mem</code>)</li> </ul> </li> <li><code>pileup-mappability:</code> Filter out sites with low 'pileup mappability', which     describes how uniquely fragments of a given size can map to the reference     (<code>true</code>/<code>false</code>)</li> <li><code>repeatmasker:</code> (NOTE: Only one of the four options should be filled/true)<ul> <li><code>bed:</code> Supply a path to a bed file that contains regions with repeats.   This is for those who want to filter out repetitive content, but don't   need to run Repeatmodeler or masker in the workflow because it has   already been done for the genome you're using. Be sure the contig names   in the bed file match those in the reference supplied. GFF or other   filetypes that work with <code>bedtools subtract</code> may also work, but haven't   been tested.</li> <li><code>local_lib:</code> Filter repeats by masking with an already made library you   have locally (such as ones downloaded for Darwin Tree of Life genomes).   Should be file path, not a URL.</li> <li><code>dfam_lib:</code> Filter repeats using a library available from dfam. Should be   a taxonomic group name.</li> <li><code>build_lib:</code> Use RepeatModeler to build a library of repeats from the   reference itself, then filter them from analysis (<code>true</code>/<code>false</code>).</li> </ul> </li> <li><code>extreme_depth:</code> Filter out sites with extremely high or low global     sequencing depth. Set the parameters for this filtering in the <code>params</code>     section of the yaml. (<code>true</code>/<code>false</code>)</li> <li><code>dataset_missing_data:</code> A floating point value between 0 and 1. Sites with     data for fewer than this proportion of individuals across the whole dataset     will be filtered out in all analyses using the filtered sites file. (This is     only needed if you need to ensure all your populations are using exactly the     same sites, which I find may result in coverage biases in results,     especially heterozygosity. Unless you explicitly need to ensure all groups     and analyses use the same sites, I would leave this blank, instead using     the [<code>params</code>][<code>angsd</code>][<code>minind_pop</code>] to set a minimum individual     threshold for each analyses, allowing analyses to maximize sites per     group/sample. This is how most papers do it.)</li> <li><code>population_missing_data:</code> A floating point value between 0 and 1. Sites     with data for fewer than this proportion of individuals in any population     will be filtered out in all populations using the filtered sites file.     (This is only needed if you need to ensure all your populations are using     exactly the same sites, which I find may result in coverage biases in     results, especially heterozygosity. Unless you explicitly need to ensure all     groups and analyses use the same sites, I would leave this blank, instead     using the [<code>params</code>][<code>angsd</code>][<code>minind_pop</code>] to set a minimum individual     threshold for each analyses, allowing analyses to maximize sites per     group/sample. This is how most papers do it.)</li> <li><code>qualimap:</code> Perform Qualimap bamqc on bam files for general quality stats     (<code>true</code>/<code>false</code>)</li> <li><code>damageprofiler:</code> Estimate post-mortem DNA damage on historical samples     with Damageprofiler (<code>true</code>/<code>false</code>) NOTE: This just adds the addition of     Damageprofiler to the already default output of MapDamage.</li> <li><code>mapdamage_rescale:</code> Rescale base quality scores using MapDamage2 to help     account for post-mortem damage in analyses (if you only want to assess     damage, use damageprofiler instead, they return the same results)     (<code>true</code>/<code>false</code>) docs</li> <li><code>estimate_ld:</code> Estimate pairwise linkage disquilibrium between sites with     ngsLD for each popualation and the whole dataset. Note, only set this if     you want to generate the LD estimates for use in downstream analyses     outside this workflow. Other analyses within this workflow that require LD     estimates (LD decay/pruning) will function properly regardless of the     setting here. (<code>true</code>/<code>false</code>)</li> <li><code>ld_decay:</code> Use ngsLD to plot LD decay curves for each population and for     the dataset as a whole (<code>true</code>/<code>false</code>)</li> <li><code>pca_pcangsd:</code> Perform Principal Component Analysis with PCAngsd. Currently     requires at least 4 samples to finish, as it will by default try to plot     PCs1-4. (<code>true</code>/<code>false</code>)</li> <li><code>admix_ngsadmix:</code> Perform admixture analysis with NGSadmix (<code>true</code>/<code>false</code>)</li> <li><code>relatedness:</code> Can be performed multiple ways, set any combination of the     three options below. Note, that I've mostly incorporated these with the     R0/R1/KING kinship methods in Waples et al. 2019, Mol. Ecol. in mind.     These methods differ slightly from how they implement this method, and will     give slightly more/less accurate estimates of kinship depending on your     reference's relationship to your samples. <code>ibsrelate_ibs</code> uses the     probabilities of all possible genotypes, so should be the most accurate     regardless, but can use a lot of memory and take a long time with many     samples. <code>ibsrelate_sfs</code> is a bit more efficient, as it does things in a     pairwise fashion in parallel, but may be biased if the segregating alleles     in your populations are not represented in the reference. <code>ngsrelate</code> uses     several methods, one of which is similar to <code>ibsrelate_sfs</code>, but may be     less accurate due to incorporating in less data. In my experience,     NGSrelate is suitable to identify up to third degree relatives in the     dataset, but only if the exact relationship can be somewhat uncertain (i.e.     you don't need to know the difference between, say, parent/offspring and     full sibling pairs, or between second degree and third degree relatives).     IBSrelate_sfs can get you greater accuracy, but may erroneously inflate     kinship if your datset has many alleles not represented in your reference.     If you notice, for instance, a large number of third degree relatives     (KING ~0.03 - 0.07) in your dataset that is not expected, it may be worth     trying the IBS based method (<code>ibsrelate_ibs</code>).<ul> <li><code>ngsrelate:</code> Co-estimate inbreeding and pairwise relatedness with   NGSrelate (<code>true</code>/<code>false</code>)</li> <li><code>ibsrelate_ibs:</code> Estimate pairwise relatedness with the IBS based method   from Waples et al. 2019, Mol. Ecol.. This can use a lot of memory, as   it has genotype likelihoods for all sites from all samples loaded into   memory, so it is done per 'chunk', which still takes a lot of time and   memory. (<code>true</code>/<code>false</code>)</li> <li><code>ibsrelate_sfs:</code> Estimate pairwise relatedness with the SFS based method   from Waples et al. 2019, Mol. Ecol.. Enabling this can greatly increase   the time needed to build the workflow DAG if you have many samples. As a   form of this method is implemented in NGSrelate, it may be more   efficient to only enable that. (<code>true</code>/<code>false</code>)</li> </ul> </li> <li><code>1dsfs:</code> Generates a one dimensional site frequency spectrum for all     populations in the sample list. Automatically enabled if <code>thetas_angsd</code> is     enabled. (<code>true</code>/<code>false</code>)</li> <li><code>1dsfs_boot:</code> Generates N bootstrap replicates of the 1D site frequency     spectrum for each population. N is determined from the <code>sfsboot</code> setting     below (<code>true</code>/<code>false</code>)</li> <li><code>2dsfs:</code> Generates a two dimensional site frequency spectrum for all unique     populations pairings in the sample list. Automatically enabled if     <code>fst_angsd</code> is enabled. (<code>true</code>/<code>false</code>)</li> <li><code>2dsfs_boot:</code> Generates N bootstrap replicates of the 2D site frequency     spectrum for each population pair. N is determined from the <code>sfsboot</code>     setting below (<code>true</code>/<code>false</code>)</li> <li><code>thetas_angsd:</code> Estimate pi, theta, and Tajima's D for each population in     windows across the genome using ANGSD (<code>true</code>/<code>false</code>)</li> <li><code>heterozygosity_angsd:</code> Estimate individual genome-wide heterozygosity     using ANGSD. Calculates confidence intervals from bootstraps.     (<code>true</code>/<code>false</code>)</li> <li><code>fst_angsd:</code> Estimate pairwise $F_{ST}$ using ANGSD. Set one or both of the     below options. Estimates both globally and in windows across the genome.<ul> <li><code>populations:</code> Pairwise $F_{ST}$ is calculated between all possible   population pairs (<code>true</code>/<code>false</code>)</li> <li><code>individuals:</code> Pairwise $F_{ST}$ is calculated between all possible   population pairs. NOTE: This can be really intensive on the DAG building   process, so I don't recommend enabling unless you're certain you want   this (<code>true</code>/<code>false</code>)</li> </ul> </li> <li><code>inbreeding_ngsf-hmm:</code> Estimates inbreeding coefficients and runs of     homozygosity using ngsF-HMM. Output is converted into an inbreeding measure     $F_ROH$, which describes the proportion of the genome in runs of     homozygosity over a certain length. (<code>true</code>/<code>false</code>)</li> <li><code>ibs_matrix:</code> Estimate pairwise identity by state distance between all     samples using ANGSD. (<code>true</code>/<code>false</code>)</li> </ul>"},{"location":"config/#subsampling-section","title":"Subsampling Section","text":"<p>As this workflow is aimed at low coverage samples, its likely there might be considerable variance in sample depth. For this reason, it may be good to subsample all your samples to a similar depth to examine if variation in depth is influencing results. To do this, set an integer value here to subsample all your samples down to and run specific analyses. This subsampling can be done in reference to the unfiltered sequencing depth, the mapping and base quality filtered sequencing depth, or the filtered sites sequencing depth. The latter is recommended, as this will ensure that sequencing depth is made uniform at the analysis stage, as it is these filtered sites that analyses are performed for.</p> <ul> <li><code>subsample_dp:</code> A mean depth to subsample your reads to. This will be done   per sample, and subsample from all the reads. If a sample already has the   same, or lower, depth than this number, it will just be used as is in the   analysis. (INT)</li> <li><code>subsample_by:</code> This determines how the 'full' sequencing depth of a sample   is calculated to determine the amount of subsampling needed to reach the   target depth. This should be one of three options: (1) <code>\"unfilt\"</code> will treat   the sequencing depth of all (unfiltered) reads and sites as the 'full' depth;   (2) <code>\"mapqbaseq\"</code> will filter out reads that don't pass the configured   mapping or base quality, then calculate depth across all sites as the 'full'   depth, (3) <code>\"sitefilt\"</code> will filter reads justa as <code>\"mapqbaseq\"</code> does, but   will only calculate the 'full' depth from sites passing the sites filter. As   the main goal of subsampling is to make depth uniform for analyses, this   latter option is preferred, as it will most accurately bring the depth of the   samples to the target depth for analyses.   (<code>\"unfilt\"</code>/<code>\"mapqbaseq\"</code>/<code>\"sitefilt\"</code>)</li> <li><code>redo_depth_filts</code>: If <code>subsample_by</code> is set to <code>\"unfilt\"</code> or <code>\"mapqbaseq\"</code>,   then it would be possible to recaculate extreme depth filters for the   subsampled dataset. Enable this to do so, otherwise, the depth filters from   the full depth bams will be used. If <code>subsample_by</code> is set to <code>\"sitefilt\"</code>   this will have no effect, as the subsampling is already in reference to a set   site list. (<code>true</code>/<code>false</code>)</li> <li><code>drop_samples</code>: When performing depth subsampling, you may want to leave some   samples out that you kept in your 'full' dataset. These can be listed here and   they will be removed from ALL depth subsampled analyses. A use case for this   might be if you have a couple samples that are below your targeted subsample   depth, and you don't want to include them. (list of strings: <code>[]</code>)</li> <li><code>subsample_analyses:</code> Individually enable analyses to be performed with the   subsampled data. These are the same as the ones above in the analyses   section. Enabling here will only run the analysis for the subsampled data,   if you want to run it for the full data as well, you need to enable it in the   analyses section as well. (<code>true</code>/<code>false</code>)</li> </ul>"},{"location":"config/#filter-sets","title":"Filter Sets","text":"<p>By default, this workflow will perform all analyses requested in the above section on all sites that pass the filters set in the above section. These outputs will contain <code>allsites-filts</code> in the filename and in the report. However, many times, it is useful to perform an analysis on different subsets of sites, for instance, to compare results for genic vs. intergenic regions, neutral sites, exons vs. introns, etc. Here, users can set an arbitrary number of additional filters using BED files. For each BED file supplied, the contents will be intersected with the sites passing the filters set in the above section, and all analyses will be performed additionally using those sites.</p> <p>For instance, given a BED file containing putatively neutral sites, one could set the following:</p> <pre><code>filter_beds:\n  neutral-sites: \"resources/neutral_sites.bed\"\n</code></pre> <p>In this case, for each requested analysis, in addition to the <code>allsites-filts</code> output, a <code>neutral-filts</code> (named after the key assigned to the BED file in <code>config.yaml</code>) output will also be generated, containing the results for sites within the specified BED file that passed any set filters.</p> <p>More than one BED file can be set, up to an arbitrary number:</p> <pre><code>filter_beds:\n  neutral: \"resources/neutral_sites.bed\"\n  intergenic: \"resources/intergenic_sites.bed\"\n  introns: \"resources/introns.bed\"\n</code></pre> <p>It may also sometimes be desireable to skip analyses on <code>allsites-filts</code>, say if you are trying to only generate diversity estimates or generate SFS for a set of neutral sites you supply.</p> <p>To skip running any analyses for <code>allsites-filts</code> and only perform them for the BED files you supply, you can set <code>only_filter_beds: true</code> in the config file. This may also be useful in the event you have a set of already filtered sites, and want to run the workflow on those, ignoring any of the built in filter options by setting them to <code>false</code>.</p>"},{"location":"config/#software-configuration","title":"Software Configuration","text":"<p>These are software specific settings that can be user configured in the workflow. If you are missing a configurable setting you need, open up an issue or a pull request and I'll gladly put it in.</p> <ul> <li><code>mapQ:</code> Phred-scaled mapping quality filter. Reads below this threshold will   be filtered out. (integer)</li> <li> <p><code>baseQ:</code> Phred-scaled base quality filter. Reads below this threshold will be   filtered out. (integer)</p> </li> <li> <p><code>params:</code></p> </li> <li><code>clipoverlap:</code><ul> <li><code>clip_user_provided_bams:</code> Determines whether overlapping read pairs will   be clipped in BAM files supplied by users. This is useful as many variant   callers will account for overlapping reads in their processing, but ANGSD   will double count overlapping reads. If BAMs were prepped without this in   mind, it can be good to apply before running through ANGSD. However, it   essentially creates a BAM file of nearly equal size for every sample, so   it may be nice to turn off if you don't care for this correction or have   already applied it on the BAMs you supply. (<code>true</code>/<code>false</code>)</li> </ul> </li> <li><code>genmap:</code> Parameters for pileup mappability analysis, see     GenMap's documentation for more     details.<ul> <li><code>K:</code></li> <li><code>E:</code></li> <li><code>map_thresh:</code> A threshold mappability score. Each site gets an average   mappability score taken by averaging the mappability of all K-mers that   would overlap it. A score of 1 means all K-mers are uniquely mappable,   allowing for <code>e</code> mismatches. This is doen via a custom script, and may   eventually be replaced by the SNPable method, which is more common.   (integer/float, 0-1)</li> </ul> </li> <li><code>extreme_depth_filt:</code> Parameters for excluding sites based on extreme high     and/or low global depth. The final sites list will contain only sites that     pass the filters for all categories requested (i.e the whole dataset     and/or the depth categories set in samples.tsv).<ul> <li><code>method:</code> Whether you will generate extreme thresholds as a multiple of   the median global depth (<code>\"median\"</code>) or as percentiles of the   global depth distribution (<code>\"percentile\"</code>)</li> <li><code>bounds:</code> The bounds of the depth cutoff, defined as a numeric list. For   the median method, the values will be multiplied by the median of the   distribution to set the thresholds (i.e. <code>[0.5,1.5]</code> would generate   a lower threshold at 0.5*median and an upper at 1.5*median). For the   percentile method, these define the lower and upper percentiles to filter   out (i.e [0.01,0.99] would remove the lower and upper 1% of the depth   distributions). (<code>[ FLOAT, FLOAT]</code>)</li> <li><code>filt-on-dataset:</code> Whether to perform this filter on the dataset as a   whole (may want to set to false if your dataset global depth distribution   is multi-modal). (<code>true</code>/<code>false</code>)</li> <li><code>filt-on-depth-classes:</code> Whether to perform this filter on the depth   classes defined in the samples.tsv file. This will generate a global   depth distribution for samples in the same category, and perform the   filtering on these distributions independently. Then, the sites that pass   for all the classes will be included. (<code>true</code>/<code>false</code>)</li> </ul> </li> <li><code>fastp:</code><ul> <li><code>extra:</code> Additional options to pass to fastp trimming. (string)</li> <li><code>min_overlap_hist:</code> Minimum overlap to collapse historical reads. Default   in fastp is 30. This effectively overrides the <code>--length_required</code> option   if it is larger than that. (INT)</li> </ul> </li> <li><code>bwa_aln:</code><ul> <li><code>extra:</code> Additional options to pass to bwa aln for mapping of historical   sample reads. (string)</li> </ul> </li> <li><code>samtools:</code><ul> <li><code>subsampling_seed:</code> Seed to use when subsampling bams to lower depth.   <code>\"$RANDOM\"</code> can be used to set a random seed, or any integer can be used   to set a consistent seed. (string or int)</li> </ul> </li> <li><code>picard:</code><ul> <li><code>MarkDuplicates:</code> Additional options to pass to Picard MarkDuplicates.   <code>--REMOVE_DUPLICATES true</code> is recommended. (string)</li> </ul> </li> <li><code>angsd:</code> General options in ANGSD, relevant doc pages are linked<ul> <li><code>gl_model:</code> Genotype likelihood model to use in calculation   (<code>-GL</code> option in ANGSD, docs)</li> <li><code>maxdepth:</code> When calculating individual depth, sites with depth higher   than this will be binned to this value. Should be fine for most to leave   at <code>1000</code>. (integer, docs)</li> <li><code>mindepthind:</code> Individuals with sequencing depth below this value at a   position will be treated as having no data at that position by ANGSD.   ANGSD defaults to 1 for this. Note that this can be separately set for   individual heterozygosity estimates with <code>mindepthind_heterozygosity</code>   below. (integer, <code>-setMinDepthInd</code> option in ANGSD) (INT)</li> <li><code>minind_dataset:</code> Used to fill the <code>-minInd</code> option for any dataset wide   ANGSD outputs (like Beagles for PCA/Admix). Should be a floating point   value between 0 and 1 describing what proportion of the dataset must have   data at a site to include it in the output. (FLOAT)</li> <li><code>minind_pop:</code> Used to fill the <code>-minInd</code> option for any population level   ANGSD outputs (like SAFs or Beagles for ngsF-HMM). Should be a floating   point value between 0 and 1 describing what proportion of the population   must have data at a site to include it in the output. (FLOAT)</li> <li><code>rmtrans:</code> Removes transitions using ANGSD, effectively removing them   from downstream analyses. This is useful for removing DNA damage from   analyses, and will automatically set the appropriate ANGSD flags (i.e.   using <code>-noTrans 1</code> for SAF files and <code>-rmTrans 1</code> for Beagle files.)   (<code>true</code>/<code>false</code>)</li> <li><code>extra:</code> Additional options to pass to ANGSD during genotype likelihood   calculation at all times. This is primarily useful for adding BAM input   filters. Note that <code>--remove_bads</code> and <code>-only_proper_pairs</code> are enabled   by default, so they only need to be included if you want to turn them   off or explicitly ensure they are enabled. I've also found that for some   datasets, <code>-C 50</code> and <code>-baq 1</code> can create a strong relationship between   sample depth and detected diversity, effectively removing the benefits of   ANGSD for low/variable depth data. I recommend that these aren't included   unless you know you need them. Since the workflow uses bwa to map,   <code>-uniqueOnly 1</code> doesn't do anything if your minimum mapping quality is   &gt; 0. Mapping and base quality thresholds are also not needed, it will   use the ones defined above automatically. If you prefer to correct for   historical damage by trimming the ends of reads, this is where you'd want   to put <code>-trim INT</code>. (string)   (string, docs)</li> <li><code>extra_saf:</code> Same as <code>extra</code>, but only used when making SAF files (used   for SFS, thetas, Fst, IBSrelate, heterozygosity includes invariable   sites). Doesn't require options already in <code>extra</code> or defined via other   params in the YAML (such as <code>notrans</code>, <code>minind</code>, <code>GL</code>, etc.) (string)</li> <li><code>extra_beagle:</code> Same as <code>extra</code>, but only used when making Beagle and Maf   files (used for PCA, Admix, ngsF-HMM, doIBS, ngsrelate, includes only   variable sites). Doesn't require options already in <code>extra</code> or defined via   other params in the YAML (such as <code>rmtrans</code>, <code>minind</code>, <code>GL</code>, etc.)   (string)</li> <li><code>snp_pval:</code> The p-value to use for calling SNPs   (float, docs) (float   or string)</li> <li><code>domajorminor:</code> Method for inferring the major and minor alleles. Set to   1 to infer from the genotype likelihoods, see   documentation   for other options. <code>1</code>, <code>2</code>, and <code>4</code> can be set without any additional   configuration. <code>5</code> must also have an ancestral reference provided in the   config, otherwise it will be the same as <code>4</code>. <code>3</code> is currently not   possible, but please open an issue if you have a use case, I'd like to   add it, but would need some input on how it is used. (int)</li> <li><code>domaf:</code> Method for inferring minor allele frequencies. Set to <code>1</code> to   infer from genotype likelihoods using a known major and minor from the   <code>domajorminor</code> setting above. See   docs for other   options. I have not tested much beyond <code>1</code> and <code>8</code>, please open an issue   if you have problems. (int)</li> <li><code>min_maf:</code> The minimum minor allele frequency required to call a SNP.   This is set when generating the beagle file, so will filter SNPs for   PCAngsd, NGSadmix, ngsF-HMM, and NGSrelate. If you would like each tool   to handle filtering for maf on its own you can set this to <code>-1</code>   (disabled). (float, docs)</li> <li><code>mindepthind_heterozygosity:</code> When estimating individual heterozygosity,   sites with sequencing depth lower than this value will be dropped.   (integer, <code>-setMinDepthInd</code> option in ANGSD) (int)</li> </ul> </li> <li><code>ngsld:</code> Settings for ngsLD (docs)<ul> <li><code>max_kb_dist_est-ld:</code> For the LD estimates generated when setting   <code>estimate_ld: true</code> above, set the maximum distance between sites in kb   that LD will be estimated for (<code>--max_kb_dist</code> in ngsLD, integer)</li> <li><code>rnd_sample_est-ld:</code> For the LD estimates generated when setting   <code>estimate_ld: true</code> above, randomly sample this proportion of pairwise   linkage estimates rather than estimating all (<code>--rnd_sample</code> in ngsLD,   float)</li> <li><code>max_kb_dist_decay:</code> The same as <code>max_kb_dist_est-ld:</code>, but used when   estimating LD decay when setting <code>ld_decay: true</code> above (integer)</li> <li><code>rnd_sample_decay:</code> The same as <code>rnd_sample_est-ld:</code>, but used when   estimating LD decay when setting <code>ld_decay: true</code> above (float)</li> <li><code>fit_LDdecay_extra:</code> Additional plotting arguments to pass to   <code>fit_LDdecay.R</code> when estimating LD decay (string)</li> <li><code>fit_LDdecay_n_correction:</code> When estimating LD decay, should the sample   size corrected r^2 model be used? (<code>true</code>/<code>false</code>, <code>true</code> is the   equivalent of passing a sample size to <code>fit_LDdecay.R</code> in ngsLD using   <code>--n_ind</code>)</li> <li><code>max_kb_dist_pruning_dataset:</code> The same as <code>max_kb_dist_est-ld:</code>, but   used when linkage pruning SNPs as inputs for PCAngsd, NGSadmix, and   NGSrelate analyses. Pruning is performed on the whole dataset. Any   positions above this distance will be assumed to be in linkage   equilibrium during the pruning process. (integer)</li> <li><code>pruning_min-weight_dataset:</code> The minimum r^2 to assume two positions are   in linkage disequilibrium when pruning for PCAngsd, NGSadmix, and   NGSrelate analyses. (float)</li> </ul> </li> <li><code>ngsf-hmm:</code> Settings for ngsF-HMM<ul> <li><code>estimate_in_pops:</code> Set to <code>true</code> to run ngsF-HMM separately for each   population in your dataset. Set to <code>false</code> to run for whole dataset at   once. ngsF-HMM assumes Hardy-Weinberg Equilibrium (aside from inbreeding)   in the input data, so select the option that most reflects this in your   data. (<code>true</code>/<code>false</code>)</li> <li><code>prune:</code> Whether or not to prune SNPs for LD before running the analysis.   ngsF-HMM assumes independent sites, so it is preferred to set this to   <code>true</code> to satisfy that expectation. (<code>true</code>/<code>false</code>)</li> <li><code>max_kb_dist_pruning_pop:</code> The maximum distance between sites in kb   that will be treated as in LD when pruning for the ngsF-HMM input. (INT)</li> <li><code>pruning_min-weight_pop:</code> The minimum r^2 to assume two positions are in   linkage disequilibrium when pruning for the ngsF-HMM input. Note, that   this likely will be substantially higher for individual populations than   for the whole dataset, as background LD should be higher when no   substructure is present. (float)</li> <li><code>min_roh_length:</code> Minimum ROH size in base pairs to include in inbreeding   coefficient calculation. Set if short ROH might be considered low   confidence for your data. (integer)</li> <li><code>roh_bins:</code> A list of integers that describe the size classes in base   pairs you would like to partition the inbreeding coefficient by. This can   help visualize how much of the coefficient comes from ROH of certain size   classes (and thus, ages). List should be in ascending order and the first   entry should be greater than <code>min_roh_length</code>. The first bin will group   ROH between <code>min_roh_length</code> and the first entry, subsequent bins will   group ROH with sizes between adjacent entries in the list, and the final   bin will group all ROH larger than the final entry in the list. (list)</li> </ul> </li> <li><code>realSFS:</code> Settings for realSFS<ul> <li><code>fold:</code> Whether or not to fold the produced SFS. Set to 1 if you have not   provided an ancestral-state reference (0 or 1, docs)</li> <li><code>sfsboot:</code> Determines number of bootstrap replicates to use when   requesting bootstrapped SFS. Is used for both 1dsfs and 2dsfs (this is   very easy to separate, open an issue if desired). Automatically used   for heterozygosity analysis to calculate confidence intervals. (integer)</li> </ul> </li> <li><code>fst:</code> Settings for $F_{ST}$ calculation in ANGSD<ul> <li><code>whichFst:</code> Determines which $F_{ST}$ estimator is used by ANGSD. With 0   being the default Reynolds 1983 and 1 being the Bhatia 2013 estimator.   The latter is preferable for small or uneven sample sizes   (0 or 1, docs)</li> <li><code>win_size:</code> Window size in bp for sliding window analysis (integer)</li> <li><code>win_step:</code> Window step size in bp for sliding window analysis (integer)</li> </ul> </li> <li><code>thetas:</code> Settings for pi, theta, and Tajima's D estimation<ul> <li><code>win_size:</code> Window size in bp for sliding window analysis (integer)</li> <li><code>win_step:</code> Window step size in bp for sliding window analysis (integer)</li> <li><code>minsites:</code> Minimum sites to include window in report plot. This does not   remove them from the actual output, just the report plot.</li> </ul> </li> <li><code>ngsadmix:</code> Settings for admixture analysis with NGSadmix. This analysis is     performed for a set of K groupings, and each K has several replicates     performed. Replicates will continue until a set of N highest likelihood     replicates converge, or the number of replicates reaches an upper threshold     set here. Defaults for <code>reps</code>, <code>minreps</code>, <code>thresh</code>, and <code>conv</code> can be left     as default for most.<ul> <li><code>kvalues:</code> A list of values of K to fit the data to (list of integers)</li> <li><code>reps:</code> The maximum number of replicates to perform per K. Default is 100.   (integer)</li> <li><code>minreps:</code> The minimum number of replicates to perform, even if   replicates have converged. Default is 20. (integer)</li> <li><code>thresh:</code> The convergence threshold - the top replicates must all be   within this value of log-likelihood units to consider the run converged.   Default is 2. (integer)</li> <li><code>conv:</code> The number of top replicates to include in convergence   assessment. Default is 3. (integer)</li> <li><code>extra:</code> Additional arguments to pass to NGSadmix (for instance,   increasing <code>-maxiter</code>). (string, docs)</li> </ul> </li> <li><code>ibs:</code> Settings for identity by state calculation with ANGSD<ul> <li><code>-doIBS:</code> Whether to use a random (1) or consensus (2) base in IBS   distance calculation   (docs)</li> </ul> </li> </ul>"},{"location":"getting-started/","title":"Getting Started","text":""},{"location":"getting-started/#tutorial","title":"Tutorial","text":"<p>Note</p> <p>A tutorial is in progress, but not yet available. The pipeline can still be used by following the rest of the guide.</p> <p>A tutorial is available with a small(ish) dataset where biologically meaningful results can be produced. This can help get an understanding of a good workflow to use different modules. You can also follow along with your own data and just skip analyses you don't want. If you prefer to just jump in instead, below describes how to quickly get a new project up and running.</p>"},{"location":"getting-started/#requirements","title":"Requirements","text":"<p>This pipeline can be run on Linux systems with Conda and Apptainer/Singularity installed. All other dependencies will be handled with the workflow, and thus, sufficient storage space is needed for these installations (~10GB, but this needs verification). It can be run on a local workstation with sufficient resources and storage space (dataset dependent), but is aimed at execution on high performance computing systems with job queuing systems.</p> <p>Data-wise, you'll need a reference genome (uncompressed) and some sequencing data for your samples. The latter can be either raw fastq files, bam alignments to the reference, or accession numbers for already published fastq files.</p>"},{"location":"getting-started/#deploying-the-workflow","title":"Deploying the workflow","text":"<p>The pipeline can be deployed in two ways: (1) using Snakedeploy which will deploy the pipeline as a module (recommended); (2) clone the repository at the version/branch you prefer (recommended if you will change any workflow code).</p> <p>Both methods require a Snakemake environment to run the pipeline in.</p>"},{"location":"getting-started/#preparing-the-environment","title":"Preparing the environment","text":"<p>First, create an environment for Snakemake, including Snakedeploy if you intend to deploy that way:</p> <pre><code>mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy\n</code></pre> <p>If you already have a Snakemake environment, you can use that, so long as it has <code>snakemake</code> (not just <code>snakemake-minimal</code>) installed. Snakemake versions &gt;=7.25 are likely to work, but most testing is on 7.32.4. It is compatible with Snakemake v8, but you may need to install additional plugins for cluster execution due to the new executor plugin system. See the Snakemake docs for what additional executor plugin you might need to enable cluster execution for your system.</p> <p>Activate the Snakemake environment:</p> <pre><code>conda activate snakemake\n</code></pre>"},{"location":"getting-started/#deploying-with-snakedeploy","title":"Deploying with Snakedeploy","text":"<p>Make your working directory:</p> <pre><code>mkdir -p /path/to/work-dir\ncd /path/to/work-dir\n</code></pre> <p>And deploy the workflow, using the tag for the version you want to deploy:</p> <pre><code>snakedeploy deploy-workflow https://github.com/zjnolen/PopGLen . --tag v0.2.0\n</code></pre> <p>This will generate a simple Snakefile in a <code>workflow</code> folder that loads the pipeline as a module. It will also download the template <code>config.yaml</code>, <code>samples.tsv</code>, and <code>units.tsv</code> in the <code>config</code> folder.</p>"},{"location":"getting-started/#cloning-from-github","title":"Cloning from GitHub","text":"<p>Go to the folder you would like you working directory to be created in and clone the GitHub repo:</p> <pre><code>git clone https://github.com/zjnolen/PopGLen.git\n</code></pre> <p>If you would like, you can change the name of the directory:</p> <pre><code>mv PopGLen work-dir-name\n</code></pre> <p>Move into the working directory (<code>PopGLen</code> or <code>work-dir-name</code> if you changed it) and checkout the version you would like to use:</p> <pre><code>git checkout v0.2.0\n</code></pre> <p>This can also be used to checkout specific branches or commits.</p>"},{"location":"getting-started/#configuring-the-workflow","title":"Configuring the workflow","text":"<p>Now you are ready to configure the workflow, see the documentation for that here.</p>"},{"location":"high-memory-rules/","title":"Rules using large amounts of RAM","text":"<p>NOTE: This is a work in progress list. Trying to figure out what</p> <p>The biggest challenge with using this pipeline with other datasets is ensuring RAM is properly allocated. Many rules require very little RAM, and so the default allocations that come on your cluster per thread will likely do fine. However, some rules require considerably more RAM. These are:</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to the documentation for PopGLen","text":"<p>PopGLen is aimed at enabling users to run population genomic analyses on their data within a genotype likelihood framework in an automated and reproducible fashion. Genotype likelihood based analyses avoid genotype calling, instead performing analyses on the likelihoods of each possible genotype, incorporating uncertainty about the true genotype into the analysis. This makes them especially suited for datasets with low coverage or that vary in coverage.</p> <p>This pipeline was developed in large part to make my own analyses easier. I work with many species being mapped to their own references within the same project. I developed this pipeline so that I could ensure standardized processing for datasets within the same project and to automate the many steps that go into performing these analyses. As it needed to fit many datasets, it is generalizable and customizable through a single configuration file and uses a common workflow utilized by ANGSD users, so it is available for others to use, should it suit their needs.</p> Questions? Feature requests? Just ask! <p>I'm glad to answer questions on the GitHub Issues page for the project, as well as take suggestions for features or improvements!</p>"},{"location":"#pipeline-summary","title":"Pipeline Summary","text":"<p>The pipeline aims to follow the general path many users will use when working with ANGSD and other GL based tools. Raw equencing data is processed into BAM files (with optional configuration for historical degraded samples) or BAM files are provided directly. From there several quality control reports are generated to help determine what samples are included. The pipeline then builds a 'sites' file to perform analyses with. This sites file is made from several user-configured filters, intersecting all and outputing a list of sites for the analyses to be performed on across all samples. This can also be extended by user-provided filter lists (e.g. to limit to neutral sites, genic regions, etc.).</p> <p>After samples have been processed, quality control reports produced, and the sites file has been produced, the pipeline can continue to the analyses.</p> <ul> <li>Linkage disequilibrium estimation, LD decay, LD pruning (ngsLD)</li> <li>PCA (PCAngsd)</li> <li>Admixture (NGSAdmix)</li> <li>Inbreeding/Runs of Homozygosity (ngsF-HMM)</li> <li>Relatedness (NGSRelate, IBSrelate)</li> <li>Identity by state matrix (ANGSD)</li> <li>Site frequency spectrum (ANGSD)</li> <li>Watterson's estimator (\u03b8<sub>w</sub>), Nucleotide diversity (\u03c0), Tajima's D   (ANGSD)</li> <li>Individual heterozygosity with bootstrapped confidence intervals (ANGSD)</li> <li>Pairwise F<sub>ST</sub> (ANGSD)</li> </ul> <p>These all can be enabled and processed independently, and the pipeline will generate genotype likelihood input files using ANGSD and share them across analyses as appropriate, deleting temporary intermediate files when they are no longer needed.</p> <p>At any point after a successful completion of a portion of the pipeline, a report can be made that contains tables and figures summarizing the results for the currently enabled parts of the pipeline.</p> <p>If you're interested in using this, head to the Getting Started page!</p>"},{"location":"config/","title":"Configuring the workflow","text":"<p>Running the workflow requires configuring three files: <code>config.yaml</code>, <code>samples.tsv</code>, and <code>units.tsv</code>. <code>config.yaml</code> is used to configure the analyses, <code>samples.tsv</code> categorizes your samples into groups, and <code>units.tsv</code> connects sample names to their input data files. The workflow will use <code>config/config.yaml</code> automatically, but you can name this whatever you want (good for separating datasets in same working directory) and point to it when running snakemake with <code>--configfile &lt;path&gt;</code>.</p>"},{"location":"config/#samplestsv","title":"<code>samples.tsv</code>","text":"<p>This file contains your sample list, and has four tab separated columns:</p> <pre><code>sample\tpopulation\ttime\tdepth\nhist1\tHjelmseryd\thistorical\tlow\nhist2\tHjelmseryd\thistorical\tlow\nhist3\tHjelmseryd\thistorical\tlow\nmod1\tGotafors\tmodern\thigh\nmod2\tGotafors\tmodern\thigh\nmod3\tGotafors\tmodern\thigh\n</code></pre> <ul> <li> <p><code>sample</code> contains the ID of a sample. It is best if it only contains   alphanumeric characters.</p> </li> <li> <p><code>population</code> contains the population the sample comes from and will be used   to group samples for population-level analyses.</p> </li> <li> <p><code>time</code> sets whether a sample should be treated as fresh DNA or historical DNA   in the sequence processing workflow. Doesn't change anything if you're   starting with bam files.</p> </li> <li> <p><code>depth</code> puts the sample in a sequencing depth category. Used for filtering -   if enabled in the configuration, extreme depth filters will be performed for   depth categories individually.</p> </li> </ul>"},{"location":"config/#unitstsv","title":"<code>units.tsv</code>","text":"<p>This file connects your samples to input files and has a potential for eight tab separated columns:</p> <pre><code>sample\tunit\tlib\tplatform\tfq1\tfq2\tbam\tsra\nhist1\tBHVN22DSX2.2\thist1\tILLUMINA\tdata/fastq/hist1.r1.fastq.gz\tdata/fastq/hist1.r2.fastq.gz\t\nhist1\tBHVN22DSX2.3\thist1\tILLUMINA\tdata/fastq/hist1.unit2.r1.fastq.gz\tdata/fastq/hist1.unit2.r2.fastq.gz\t\nhist2\tBHVN22DSX2.2\thist2\tILLUMINA\tdata/fastq/hist2.r1.fastq.gz\tdata/fastq/hist2.r2.fastq.gz\t\nhist3\tBHVN22DSX2.2\thist2\tILLUMINA\tdata/fastq/hist3.r1.fastq.gz\tdata/fastq/hist3.r2.fastq.gz\t\nmod1\tAHW5NGDSX2.3\tmod1\tILLUMINA\tdata/fastq/mod1.r1.fastq.gz\tdata/fastq/mod1.r2.fastq.gz\t\nmod2\tAHW5NGDSX2.3\tmod2\tILLUMINA\t\t\tdata/bam/mod2.bam\nmod3\tAHW5NGDSX2.3\tmod3\tILLUMINA\tdata/fastq/mod3.r1.fastq.gz\tdata/fastq/mod3.r2.fastq.gz\t\nSAMN13218652\tSRR10398077\tSAMN13218652\tILLUMINA\t\t\t\tSRR10398077\n</code></pre> <ul> <li><code>sample</code> contains the ID of a sample. Must be same as in <code>samples.tsv</code> and   may be listed multiple times when inputting multiple sequencing runs/libraries.</li> <li><code>unit</code> contains the sequencing unit, i.e. the sequencing lane barcode and   lane number. This is used in the PU and (part of) the ID read groups. If you   don't have multiple sequencing lanes per samples, this won't impact anything.   Doesn't do anything when using bam input.</li> <li><code>lib</code> contains the name of the library identifier for the entry. Fills in   the LB and (part of) the ID read groups and is used for PCR duplicate removal.   Best practice would be to have the combination of <code>unit</code> and <code>lib</code> to be unique   per line. An easy way to use this is to use the Illumina library identifier or   another unique library identifier, or simply combine a generic name with the   sample name (sample1A, sample1B, etc.). Doesn't do anything when using bam   input.</li> <li><code>platform</code> is used to fill the PL read group. Commonly is just 'ILLUMINA'.   Doesn't do anything when using bam input.</li> <li><code>fq1</code> and <code>fq2</code> provides the absolute or relative to the working directory   paths to the raw sequencing files corresponding to the metadata in the previous   columns.</li> <li><code>bam</code> provides the absolute or relative to the working directory path of   pre-processed bam files. Only one bam files should be provided per sample in   the units file.</li> <li><code>sra</code> provides the NCBI SRA accession number for a set of paired end fastq   files that will be downloaded to be processed. If a sample has multiple runs   you would like to include, each run should be its own line in the units sheet,   just as separate sequencing runs would be.</li> </ul> <p>Mixing samples with different starting points</p> <p>It is possible to have different samples start from different inputs (i.e. some from bam, others from fastq, others from SRA). It is best to provide only <code>fq1</code>+<code>fq2</code>, <code>bam</code>, or <code>sra</code> for each sample to be clear where each sample starts. If multiple are provided for the same sample, the bam file will override fastq or SRA entries, and the fastq will override SRA entries. Note that this means it is not currently possible to have multiple starting points for the same sample (i.e. FASTQ reads that would be processed then merged into an existing BAM).</p>"},{"location":"config/#configuration-file","title":"Configuration file","text":"<p><code>config.yaml</code> contains the configuration for the workflow, this is where you will put what analyses, filters, and options you want. Below I describe the configuration options. The <code>config.yaml</code> in this repository serves as a template, but includes some 'default' parameters that may be good starting points for some users. If <code>--configfile</code> is not specified in the snakemake command, the workflow will default to <code>config/config.yaml</code>.</p>"},{"location":"config/#configuration-options","title":"Configuration options","text":""},{"location":"config/#dataset-configuration","title":"Dataset Configuration","text":"<p>Required configuration of the 'dataset'.</p> <ul> <li><code>samples:</code> An absolute or relative path from the working directory to the   <code>samples.tsv</code> file.</li> <li><code>units:</code> An absolute or relative path from the working directory to the   <code>units.tsv</code> file.</li> <li><code>dataset:</code> A name for this dataset run - essentially, an identifier for a   batch of samples to be analysed together with the same configuration.</li> </ul> <p>Here, dataset means a set of samples and configurations that the workflow will be run with. Each dataset should have its own <code>samples.tsv</code> and <code>config.yaml</code>, but the same <code>units.tsv</code> can be used for multiple if you prefer. Essentially, what the dataset identifier does is keeps your outputs organized into projects, so that the same BAM files can be used in multiple datasets without having to be remade.</p> <p>So, say you have <code>dataset1_samples.tsv</code> and <code>dataset2_samples.tsv</code>, with corresponding <code>dataset1_config.tsv</code> and <code>dataset2_config.yaml</code>. The sample files contain different samples, though some are shared between the datasets. The workflow for dataset1 can be run, and then dataset2 can be run. When dataset2 runs, it map new samples, but won't re-map samples processed in dataset1. Each will perform downstream analyses independently with their sample set and configuration files, storing these results in dataset specific folders.</p>"},{"location":"config/#reference-configuration","title":"Reference Configuration","text":"<p>Required configuration of the reference.</p> <ul> <li> <p><code>chunk_size:</code> A size in bp (integer). Your reference will be analyzed in   'chunks' of contigs of this size to parallelize processing. This size should   be larger than the largest contig in your genome. A larger number means fewer   jobs that run longer. A smaller number means more jobs that run shorter. The   best fit will depend on the reference and the compute resources you have   available. Leaving this blank will not divide the reference up into chunks   (but this isn't optimized yet, so it will do a couple unnecessary steps).</p> </li> <li> <p><code>reference:</code></p> </li> <li><code>name:</code> A name for your reference genome, will go in the file names.</li> <li><code>fasta:</code> A path to the reference fasta file (currently only supports     uncompressed fasta files)</li> <li><code>mito:</code> Mitochondrial contig name(s), will be removed from analysis. Should     be listed within brackets []</li> <li><code>sex-linked:</code> Sex-linked contig name(s), will be removed from analysis.     Should be listed within brackets []</li> <li><code>exclude:</code> Additional contig name(s) to exclude from analysis. Should be     listed within brackets []</li> <li> <p><code>min_size:</code> A size in bp (integer). All contigs below this size will be     excluded from analysis.</p> </li> <li> <p><code>ancestral:</code> A path to a fasta file containing the ancestral states in your   reference genome. This is optional, and is used to polarize allele   frequencies in SAF files to ancestral/derived. If you leave this empty,   the reference genome itself will be used as ancestral, and you should be   sure the [<code>params</code>] [<code>realSFS</code>] [<code>fold</code>] is set to <code>1</code>. If you put a fasta   here, you can set that to <code>0</code>.</p> </li> </ul> <p>Reference genomes should be uncompressed, and contig names should be clear and concise. Currently, there are some issues parsing contig names with underscores, so please change these in your reference before running the pipeline. Alphanumeric characters, as well as <code>.</code> in contig names have been tested to work so far, other symbols have not been tested.</p> <p>Potentially the ability to use bgzipped genomes will be added, I just need to check that it works with all underlying tools. Currently, it will for sure not work, as calculating chunks is hard-coded to work on an uncompressed genome.</p>"},{"location":"config/#sample-set-configuration","title":"Sample Set Configuration","text":"<ul> <li><code>exclude_ind:</code> Sample name(s) that will be excluded from the workflow. Should   be a list in []. Putting a <code>#</code> in front of the sample in the sample list also   works. Mainly used to drop samples with poor quality after initial processing.</li> <li><code>excl_pca-admix:</code> Sample name(s) that will be excluded only from PCA and   Admixture analyses. Useful for close relatives that violate the assumptions   of these analyses, but that you want in others. Should be a list in []. If you   want relatives out of all downstream analyses, not just PCA/Admix, put them in   <code>exclude_ind</code> instead. Note this will trigger a re-run for relatedness   analyses, but you can just disable them now as they've already been run.</li> </ul>"},{"location":"config/#analysis-selection","title":"Analysis Selection","text":"<p>Here, you will define which analyses you will perform. It is useful to start with only a few, and add more in subsequent workflow runs, just to ensure you catch errors before you use compute time running all analyses. Most are set with (<code>true</code>/<code>false</code>) or a value, described below. Modifications to the settings for each analysis are set in the next section.</p> <ul> <li> <p><code>populations:</code> A list of populations found in your sample list to limit   population analyses to. Might be useful if you want to perform individual   analyses on some samples but not include them in any population level   analyses. Leave blank (<code>[]</code>) if you want population level analyses on all the   populations defined in your <code>samples.tsv</code> file.</p> </li> <li> <p><code>analyses:</code></p> </li> <li><code>mapping:</code><ul> <li><code>historical_only_collapsed:</code> Historical samples are expected to have   fragmented DNA. For this reason, overlapping (i.e. shorter, usually   &lt;270bp) read pairs are collapsed in this workflow for historical samples.   Setting this option to <code>true</code> will only map only these collapsed reads,   and is recommended to target primarily endogenous content. However, in   the event you want to map both the collapsed and uncollapsed reads, you   can set this to <code>false</code>. (<code>true</code>/<code>false</code>)</li> <li><code>historical_collapsed_aligner:</code> Aligner used to map collapsed historical   sample reads. <code>aln</code> is the recommended for this, but this is here in case   you would like to select <code>mem</code> for this. Uncollapsed historical reads   will be mapped with <code>mem</code> if <code>historical_only_collapsed</code> is set to   <code>false</code>, regardless of what is put here. (<code>aln</code>/<code>mem</code>)</li> </ul> </li> <li><code>pileup-mappability:</code> Filter out sites with low 'pileup mappability', which     describes how uniquely fragments of a given size can map to the reference     (<code>true</code>/<code>false</code>)</li> <li><code>repeatmasker:</code> (NOTE: Only one of the four options should be filled/true)<ul> <li><code>bed:</code> Supply a path to a bed file that contains regions with repeats.   This is for those who want to filter out repetitive content, but don't   need to run Repeatmodeler or masker in the workflow because it has   already been done for the genome you're using. Be sure the contig names   in the bed file match those in the reference supplied. GFF or other   filetypes that work with <code>bedtools subtract</code> may also work, but haven't   been tested.</li> <li><code>local_lib:</code> Filter repeats by masking with an already made library you   have locally (such as ones downloaded for Darwin Tree of Life genomes).   Should be file path, not a URL.</li> <li><code>dfam_lib:</code> Filter repeats using a library available from dfam. Should be   a taxonomic group name.</li> <li><code>build_lib:</code> Use RepeatModeler to build a library of repeats from the   reference itself, then filter them from analysis (<code>true</code>/<code>false</code>).</li> </ul> </li> <li><code>extreme_depth:</code> Filter out sites with extremely high or low global     sequencing depth. Set the parameters for this filtering in the <code>params</code>     section of the yaml. (<code>true</code>/<code>false</code>)</li> <li><code>dataset_missing_data:</code> A floating point value between 0 and 1. Sites with     data for fewer than this proportion of individuals across the whole dataset     will be filtered out in all analyses using the filtered sites file. (This is     only needed if you need to ensure all your populations are using exactly the     same sites, which I find may result in coverage biases in results,     especially heterozygosity. Unless you explicitly need to ensure all groups     and analyses use the same sites, I would leave this blank, instead using     the [<code>params</code>][<code>angsd</code>][<code>minind_pop</code>] to set a minimum individual     threshold for each analyses, allowing analyses to maximize sites per     group/sample. This is how most papers do it.)</li> <li><code>population_missing_data:</code> A floating point value between 0 and 1. Sites     with data for fewer than this proportion of individuals in any population     will be filtered out in all populations using the filtered sites file.     (This is only needed if you need to ensure all your populations are using     exactly the same sites, which I find may result in coverage biases in     results, especially heterozygosity. Unless you explicitly need to ensure all     groups and analyses use the same sites, I would leave this blank, instead     using the [<code>params</code>][<code>angsd</code>][<code>minind_pop</code>] to set a minimum individual     threshold for each analyses, allowing analyses to maximize sites per     group/sample. This is how most papers do it.)</li> <li><code>qualimap:</code> Perform Qualimap bamqc on bam files for general quality stats     (<code>true</code>/<code>false</code>)</li> <li><code>damageprofiler:</code> Estimate post-mortem DNA damage on historical samples     with Damageprofiler (<code>true</code>/<code>false</code>) NOTE: This just adds the addition of     Damageprofiler to the already default output of MapDamage.</li> <li><code>mapdamage_rescale:</code> Rescale base quality scores using MapDamage2 to help     account for post-mortem damage in analyses (if you only want to assess     damage, use damageprofiler instead, they return the same results)     (<code>true</code>/<code>false</code>) docs</li> <li><code>estimate_ld:</code> Estimate pairwise linkage disquilibrium between sites with     ngsLD for each popualation and the whole dataset. Note, only set this if     you want to generate the LD estimates for use in downstream analyses     outside this workflow. Other analyses within this workflow that require LD     estimates (LD decay/pruning) will function properly regardless of the     setting here. (<code>true</code>/<code>false</code>)</li> <li><code>ld_decay:</code> Use ngsLD to plot LD decay curves for each population and for     the dataset as a whole (<code>true</code>/<code>false</code>)</li> <li><code>pca_pcangsd:</code> Perform Principal Component Analysis with PCAngsd. Currently     requires at least 4 samples to finish, as it will by default try to plot     PCs1-4. (<code>true</code>/<code>false</code>)</li> <li><code>admix_ngsadmix:</code> Perform admixture analysis with NGSadmix (<code>true</code>/<code>false</code>)</li> <li><code>relatedness:</code> Can be performed multiple ways, set any combination of the     three options below. Note, that I've mostly incorporated these with the     R0/R1/KING kinship methods in Waples et al. 2019, Mol. Ecol. in mind.     These methods differ slightly from how they implement this method, and will     give slightly more/less accurate estimates of kinship depending on your     reference's relationship to your samples. <code>ibsrelate_ibs</code> uses the     probabilities of all possible genotypes, so should be the most accurate     regardless, but can use a lot of memory and take a long time with many     samples. <code>ibsrelate_sfs</code> is a bit more efficient, as it does things in a     pairwise fashion in parallel, but may be biased if the segregating alleles     in your populations are not represented in the reference. <code>ngsrelate</code> uses     several methods, one of which is similar to <code>ibsrelate_sfs</code>, but may be     less accurate due to incorporating in less data. In my experience,     NGSrelate is suitable to identify up to third degree relatives in the     dataset, but only if the exact relationship can be somewhat uncertain (i.e.     you don't need to know the difference between, say, parent/offspring and     full sibling pairs, or between second degree and third degree relatives).     IBSrelate_sfs can get you greater accuracy, but may erroneously inflate     kinship if your datset has many alleles not represented in your reference.     If you notice, for instance, a large number of third degree relatives     (KING ~0.03 - 0.07) in your dataset that is not expected, it may be worth     trying the IBS based method (<code>ibsrelate_ibs</code>).<ul> <li><code>ngsrelate:</code> Co-estimate inbreeding and pairwise relatedness with   NGSrelate (<code>true</code>/<code>false</code>)</li> <li><code>ibsrelate_ibs:</code> Estimate pairwise relatedness with the IBS based method   from Waples et al. 2019, Mol. Ecol.. This can use a lot of memory, as   it has genotype likelihoods for all sites from all samples loaded into   memory, so it is done per 'chunk', which still takes a lot of time and   memory. (<code>true</code>/<code>false</code>)</li> <li><code>ibsrelate_sfs:</code> Estimate pairwise relatedness with the SFS based method   from Waples et al. 2019, Mol. Ecol.. Enabling this can greatly increase   the time needed to build the workflow DAG if you have many samples. As a   form of this method is implemented in NGSrelate, it may be more   efficient to only enable that. (<code>true</code>/<code>false</code>)</li> </ul> </li> <li><code>1dsfs:</code> Generates a one dimensional site frequency spectrum for all     populations in the sample list. Automatically enabled if <code>thetas_angsd</code> is     enabled. (<code>true</code>/<code>false</code>)</li> <li><code>1dsfs_boot:</code> Generates N bootstrap replicates of the 1D site frequency     spectrum for each population. N is determined from the <code>sfsboot</code> setting     below (<code>true</code>/<code>false</code>)</li> <li><code>2dsfs:</code> Generates a two dimensional site frequency spectrum for all unique     populations pairings in the sample list. Automatically enabled if     <code>fst_angsd</code> is enabled. (<code>true</code>/<code>false</code>)</li> <li><code>2dsfs_boot:</code> Generates N bootstrap replicates of the 2D site frequency     spectrum for each population pair. N is determined from the <code>sfsboot</code>     setting below (<code>true</code>/<code>false</code>)</li> <li><code>thetas_angsd:</code> Estimate pi, theta, and Tajima's D for each population in     windows across the genome using ANGSD (<code>true</code>/<code>false</code>)</li> <li><code>heterozygosity_angsd:</code> Estimate individual genome-wide heterozygosity     using ANGSD. Calculates confidence intervals from bootstraps.     (<code>true</code>/<code>false</code>)</li> <li><code>fst_angsd:</code> Estimate pairwise $F_{ST}$ using ANGSD. Set one or both of the     below options. Estimates both globally and in windows across the genome.<ul> <li><code>populations:</code> Pairwise $F_{ST}$ is calculated between all possible   population pairs (<code>true</code>/<code>false</code>)</li> <li><code>individuals:</code> Pairwise $F_{ST}$ is calculated between all possible   population pairs. NOTE: This can be really intensive on the DAG building   process, so I don't recommend enabling unless you're certain you want   this (<code>true</code>/<code>false</code>)</li> </ul> </li> <li><code>inbreeding_ngsf-hmm:</code> Estimates inbreeding coefficients and runs of     homozygosity using ngsF-HMM. Output is converted into an inbreeding measure     $F_ROH$, which describes the proportion of the genome in runs of     homozygosity over a certain length. (<code>true</code>/<code>false</code>)</li> <li><code>ibs_matrix:</code> Estimate pairwise identity by state distance between all     samples using ANGSD. (<code>true</code>/<code>false</code>)</li> </ul>"},{"location":"config/#subsampling-section","title":"Subsampling Section","text":"<p>As this workflow is aimed at low coverage samples, its likely there might be considerable variance in sample depth. For this reason, it may be good to subsample all your samples to a similar depth to examine if variation in depth is influencing results. To do this, set an integer value here to subsample all your samples down to and run specific analyses. This subsampling can be done in reference to the unfiltered sequencing depth, the mapping and base quality filtered sequencing depth, or the filtered sites sequencing depth. The latter is recommended, as this will ensure that sequencing depth is made uniform at the analysis stage, as it is these filtered sites that analyses are performed for.</p> <ul> <li><code>subsample_dp:</code> A mean depth to subsample your reads to. This will be done   per sample, and subsample from all the reads. If a sample already has the   same, or lower, depth than this number, it will just be used as is in the   analysis. (INT)</li> <li><code>subsample_by:</code> This determines how the 'full' sequencing depth of a sample   is calculated to determine the amount of subsampling needed to reach the   target depth. This should be one of three options: (1) <code>\"unfilt\"</code> will treat   the sequencing depth of all (unfiltered) reads and sites as the 'full' depth;   (2) <code>\"mapqbaseq\"</code> will filter out reads that don't pass the configured   mapping or base quality, then calculate depth across all sites as the 'full'   depth, (3) <code>\"sitefilt\"</code> will filter reads justa as <code>\"mapqbaseq\"</code> does, but   will only calculate the 'full' depth from sites passing the sites filter. As   the main goal of subsampling is to make depth uniform for analyses, this   latter option is preferred, as it will most accurately bring the depth of the   samples to the target depth for analyses.   (<code>\"unfilt\"</code>/<code>\"mapqbaseq\"</code>/<code>\"sitefilt\"</code>)</li> <li><code>redo_depth_filts</code>: If <code>subsample_by</code> is set to <code>\"unfilt\"</code> or <code>\"mapqbaseq\"</code>,   then it would be possible to recaculate extreme depth filters for the   subsampled dataset. Enable this to do so, otherwise, the depth filters from   the full depth bams will be used. If <code>subsample_by</code> is set to <code>\"sitefilt\"</code>   this will have no effect, as the subsampling is already in reference to a set   site list. (<code>true</code>/<code>false</code>)</li> <li><code>drop_samples</code>: When performing depth subsampling, you may want to leave some   samples out that you kept in your 'full' dataset. These can be listed here and   they will be removed from ALL depth subsampled analyses. A use case for this   might be if you have a couple samples that are below your targeted subsample   depth, and you don't want to include them. (list of strings: <code>[]</code>)</li> <li><code>subsample_analyses:</code> Individually enable analyses to be performed with the   subsampled data. These are the same as the ones above in the analyses   section. Enabling here will only run the analysis for the subsampled data,   if you want to run it for the full data as well, you need to enable it in the   analyses section as well. (<code>true</code>/<code>false</code>)</li> </ul>"},{"location":"config/#filter-sets","title":"Filter Sets","text":"<p>By default, this workflow will perform all analyses requested in the above section on all sites that pass the filters set in the above section. These outputs will contain <code>allsites-filts</code> in the filename and in the report. However, many times, it is useful to perform an analysis on different subsets of sites, for instance, to compare results for genic vs. intergenic regions, neutral sites, exons vs. introns, etc. Here, users can set an arbitrary number of additional filters using BED files. For each BED file supplied, the contents will be intersected with the sites passing the filters set in the above section, and all analyses will be performed additionally using those sites.</p> <p>For instance, given a BED file containing putatively neutral sites, one could set the following:</p> <pre><code>filter_beds:\n  neutral-sites: \"resources/neutral_sites.bed\"\n</code></pre> <p>In this case, for each requested analysis, in addition to the <code>allsites-filts</code> output, a <code>neutral-filts</code> (named after the key assigned to the BED file in <code>config.yaml</code>) output will also be generated, containing the results for sites within the specified BED file that passed any set filters.</p> <p>More than one BED file can be set, up to an arbitrary number:</p> <pre><code>filter_beds:\n  neutral: \"resources/neutral_sites.bed\"\n  intergenic: \"resources/intergenic_sites.bed\"\n  introns: \"resources/introns.bed\"\n</code></pre> <p>It may also sometimes be desireable to skip analyses on <code>allsites-filts</code>, say if you are trying to only generate diversity estimates or generate SFS for a set of neutral sites you supply.</p> <p>To skip running any analyses for <code>allsites-filts</code> and only perform them for the BED files you supply, you can set <code>only_filter_beds: true</code> in the config file. This may also be useful in the event you have a set of already filtered sites, and want to run the workflow on those, ignoring any of the built in filter options by setting them to <code>false</code>.</p>"},{"location":"config/#software-configuration","title":"Software Configuration","text":"<p>These are software specific settings that can be user configured in the workflow. If you are missing a configurable setting you need, open up an issue or a pull request and I'll gladly put it in.</p> <ul> <li><code>mapQ:</code> Phred-scaled mapping quality filter. Reads below this threshold will   be filtered out. (integer)</li> <li> <p><code>baseQ:</code> Phred-scaled base quality filter. Reads below this threshold will be   filtered out. (integer)</p> </li> <li> <p><code>params:</code></p> </li> <li><code>clipoverlap:</code><ul> <li><code>clip_user_provided_bams:</code> Determines whether overlapping read pairs will   be clipped in BAM files supplied by users. This is useful as many variant   callers will account for overlapping reads in their processing, but ANGSD   will double count overlapping reads. If BAMs were prepped without this in   mind, it can be good to apply before running through ANGSD. However, it   essentially creates a BAM file of nearly equal size for every sample, so   it may be nice to turn off if you don't care for this correction or have   already applied it on the BAMs you supply. (<code>true</code>/<code>false</code>)</li> </ul> </li> <li><code>genmap:</code> Parameters for pileup mappability analysis, see     GenMap's documentation for more     details.<ul> <li><code>K:</code></li> <li><code>E:</code></li> <li><code>map_thresh:</code> A threshold mappability score. Each site gets an average   mappability score taken by averaging the mappability of all K-mers that   would overlap it. A score of 1 means all K-mers are uniquely mappable,   allowing for <code>e</code> mismatches. This is doen via a custom script, and may   eventually be replaced by the SNPable method, which is more common.   (integer/float, 0-1)</li> </ul> </li> <li><code>extreme_depth_filt:</code> Parameters for excluding sites based on extreme high     and/or low global depth. The final sites list will contain only sites that     pass the filters for all categories requested (i.e the whole dataset     and/or the depth categories set in samples.tsv).<ul> <li><code>method:</code> Whether you will generate extreme thresholds as a multiple of   the median global depth (<code>\"median\"</code>) or as percentiles of the   global depth distribution (<code>\"percentile\"</code>)</li> <li><code>bounds:</code> The bounds of the depth cutoff, defined as a numeric list. For   the median method, the values will be multiplied by the median of the   distribution to set the thresholds (i.e. <code>[0.5,1.5]</code> would generate   a lower threshold at 0.5*median and an upper at 1.5*median). For the   percentile method, these define the lower and upper percentiles to filter   out (i.e [0.01,0.99] would remove the lower and upper 1% of the depth   distributions). (<code>[ FLOAT, FLOAT]</code>)</li> <li><code>filt-on-dataset:</code> Whether to perform this filter on the dataset as a   whole (may want to set to false if your dataset global depth distribution   is multi-modal). (<code>true</code>/<code>false</code>)</li> <li><code>filt-on-depth-classes:</code> Whether to perform this filter on the depth   classes defined in the samples.tsv file. This will generate a global   depth distribution for samples in the same category, and perform the   filtering on these distributions independently. Then, the sites that pass   for all the classes will be included. (<code>true</code>/<code>false</code>)</li> </ul> </li> <li><code>fastp:</code><ul> <li><code>extra:</code> Additional options to pass to fastp trimming. (string)</li> <li><code>min_overlap_hist:</code> Minimum overlap to collapse historical reads. Default   in fastp is 30. This effectively overrides the <code>--length_required</code> option   if it is larger than that. (INT)</li> </ul> </li> <li><code>bwa_aln:</code><ul> <li><code>extra:</code> Additional options to pass to bwa aln for mapping of historical   sample reads. (string)</li> </ul> </li> <li><code>samtools:</code><ul> <li><code>subsampling_seed:</code> Seed to use when subsampling bams to lower depth.   <code>\"$RANDOM\"</code> can be used to set a random seed, or any integer can be used   to set a consistent seed. (string or int)</li> </ul> </li> <li><code>picard:</code><ul> <li><code>MarkDuplicates:</code> Additional options to pass to Picard MarkDuplicates.   <code>--REMOVE_DUPLICATES true</code> is recommended. (string)</li> </ul> </li> <li><code>angsd:</code> General options in ANGSD, relevant doc pages are linked<ul> <li><code>gl_model:</code> Genotype likelihood model to use in calculation   (<code>-GL</code> option in ANGSD, docs)</li> <li><code>maxdepth:</code> When calculating individual depth, sites with depth higher   than this will be binned to this value. Should be fine for most to leave   at <code>1000</code>. (integer, docs)</li> <li><code>mindepthind:</code> Individuals with sequencing depth below this value at a   position will be treated as having no data at that position by ANGSD.   ANGSD defaults to 1 for this. Note that this can be separately set for   individual heterozygosity estimates with <code>mindepthind_heterozygosity</code>   below. (integer, <code>-setMinDepthInd</code> option in ANGSD) (INT)</li> <li><code>minind_dataset:</code> Used to fill the <code>-minInd</code> option for any dataset wide   ANGSD outputs (like Beagles for PCA/Admix). Should be a floating point   value between 0 and 1 describing what proportion of the dataset must have   data at a site to include it in the output. (FLOAT)</li> <li><code>minind_pop:</code> Used to fill the <code>-minInd</code> option for any population level   ANGSD outputs (like SAFs or Beagles for ngsF-HMM). Should be a floating   point value between 0 and 1 describing what proportion of the population   must have data at a site to include it in the output. (FLOAT)</li> <li><code>rmtrans:</code> Removes transitions using ANGSD, effectively removing them   from downstream analyses. This is useful for removing DNA damage from   analyses, and will automatically set the appropriate ANGSD flags (i.e.   using <code>-noTrans 1</code> for SAF files and <code>-rmTrans 1</code> for Beagle files.)   (<code>true</code>/<code>false</code>)</li> <li><code>extra:</code> Additional options to pass to ANGSD during genotype likelihood   calculation at all times. This is primarily useful for adding BAM input   filters. Note that <code>--remove_bads</code> and <code>-only_proper_pairs</code> are enabled   by default, so they only need to be included if you want to turn them   off or explicitly ensure they are enabled. I've also found that for some   datasets, <code>-C 50</code> and <code>-baq 1</code> can create a strong relationship between   sample depth and detected diversity, effectively removing the benefits of   ANGSD for low/variable depth data. I recommend that these aren't included   unless you know you need them. Since the workflow uses bwa to map,   <code>-uniqueOnly 1</code> doesn't do anything if your minimum mapping quality is   &gt; 0. Mapping and base quality thresholds are also not needed, it will   use the ones defined above automatically. If you prefer to correct for   historical damage by trimming the ends of reads, this is where you'd want   to put <code>-trim INT</code>. (string)   (string, docs)</li> <li><code>extra_saf:</code> Same as <code>extra</code>, but only used when making SAF files (used   for SFS, thetas, Fst, IBSrelate, heterozygosity includes invariable   sites). Doesn't require options already in <code>extra</code> or defined via other   params in the YAML (such as <code>notrans</code>, <code>minind</code>, <code>GL</code>, etc.) (string)</li> <li><code>extra_beagle:</code> Same as <code>extra</code>, but only used when making Beagle and Maf   files (used for PCA, Admix, ngsF-HMM, doIBS, ngsrelate, includes only   variable sites). Doesn't require options already in <code>extra</code> or defined via   other params in the YAML (such as <code>rmtrans</code>, <code>minind</code>, <code>GL</code>, etc.)   (string)</li> <li><code>snp_pval:</code> The p-value to use for calling SNPs   (float, docs) (float   or string)</li> <li><code>domajorminor:</code> Method for inferring the major and minor alleles. Set to   1 to infer from the genotype likelihoods, see   documentation   for other options. <code>1</code>, <code>2</code>, and <code>4</code> can be set without any additional   configuration. <code>5</code> must also have an ancestral reference provided in the   config, otherwise it will be the same as <code>4</code>. <code>3</code> is currently not   possible, but please open an issue if you have a use case, I'd like to   add it, but would need some input on how it is used. (int)</li> <li><code>domaf:</code> Method for inferring minor allele frequencies. Set to <code>1</code> to   infer from genotype likelihoods using a known major and minor from the   <code>domajorminor</code> setting above. See   docs for other   options. I have not tested much beyond <code>1</code> and <code>8</code>, please open an issue   if you have problems. (int)</li> <li><code>min_maf:</code> The minimum minor allele frequency required to call a SNP.   This is set when generating the beagle file, so will filter SNPs for   PCAngsd, NGSadmix, ngsF-HMM, and NGSrelate. If you would like each tool   to handle filtering for maf on its own you can set this to <code>-1</code>   (disabled). (float, docs)</li> <li><code>mindepthind_heterozygosity:</code> When estimating individual heterozygosity,   sites with sequencing depth lower than this value will be dropped.   (integer, <code>-setMinDepthInd</code> option in ANGSD) (int)</li> </ul> </li> <li><code>ngsld:</code> Settings for ngsLD (docs)<ul> <li><code>max_kb_dist_est-ld:</code> For the LD estimates generated when setting   <code>estimate_ld: true</code> above, set the maximum distance between sites in kb   that LD will be estimated for (<code>--max_kb_dist</code> in ngsLD, integer)</li> <li><code>rnd_sample_est-ld:</code> For the LD estimates generated when setting   <code>estimate_ld: true</code> above, randomly sample this proportion of pairwise   linkage estimates rather than estimating all (<code>--rnd_sample</code> in ngsLD,   float)</li> <li><code>max_kb_dist_decay:</code> The same as <code>max_kb_dist_est-ld:</code>, but used when   estimating LD decay when setting <code>ld_decay: true</code> above (integer)</li> <li><code>rnd_sample_decay:</code> The same as <code>rnd_sample_est-ld:</code>, but used when   estimating LD decay when setting <code>ld_decay: true</code> above (float)</li> <li><code>fit_LDdecay_extra:</code> Additional plotting arguments to pass to   <code>fit_LDdecay.R</code> when estimating LD decay (string)</li> <li><code>fit_LDdecay_n_correction:</code> When estimating LD decay, should the sample   size corrected r^2 model be used? (<code>true</code>/<code>false</code>, <code>true</code> is the   equivalent of passing a sample size to <code>fit_LDdecay.R</code> in ngsLD using   <code>--n_ind</code>)</li> <li><code>max_kb_dist_pruning_dataset:</code> The same as <code>max_kb_dist_est-ld:</code>, but   used when linkage pruning SNPs as inputs for PCAngsd, NGSadmix, and   NGSrelate analyses. Pruning is performed on the whole dataset. Any   positions above this distance will be assumed to be in linkage   equilibrium during the pruning process. (integer)</li> <li><code>pruning_min-weight_dataset:</code> The minimum r^2 to assume two positions are   in linkage disequilibrium when pruning for PCAngsd, NGSadmix, and   NGSrelate analyses. (float)</li> </ul> </li> <li><code>ngsf-hmm:</code> Settings for ngsF-HMM<ul> <li><code>estimate_in_pops:</code> Set to <code>true</code> to run ngsF-HMM separately for each   population in your dataset. Set to <code>false</code> to run for whole dataset at   once. ngsF-HMM assumes Hardy-Weinberg Equilibrium (aside from inbreeding)   in the input data, so select the option that most reflects this in your   data. (<code>true</code>/<code>false</code>)</li> <li><code>prune:</code> Whether or not to prune SNPs for LD before running the analysis.   ngsF-HMM assumes independent sites, so it is preferred to set this to   <code>true</code> to satisfy that expectation. (<code>true</code>/<code>false</code>)</li> <li><code>max_kb_dist_pruning_pop:</code> The maximum distance between sites in kb   that will be treated as in LD when pruning for the ngsF-HMM input. (INT)</li> <li><code>pruning_min-weight_pop:</code> The minimum r^2 to assume two positions are in   linkage disequilibrium when pruning for the ngsF-HMM input. Note, that   this likely will be substantially higher for individual populations than   for the whole dataset, as background LD should be higher when no   substructure is present. (float)</li> <li><code>min_roh_length:</code> Minimum ROH size in base pairs to include in inbreeding   coefficient calculation. Set if short ROH might be considered low   confidence for your data. (integer)</li> <li><code>roh_bins:</code> A list of integers that describe the size classes in base   pairs you would like to partition the inbreeding coefficient by. This can   help visualize how much of the coefficient comes from ROH of certain size   classes (and thus, ages). List should be in ascending order and the first   entry should be greater than <code>min_roh_length</code>. The first bin will group   ROH between <code>min_roh_length</code> and the first entry, subsequent bins will   group ROH with sizes between adjacent entries in the list, and the final   bin will group all ROH larger than the final entry in the list. (list)</li> </ul> </li> <li><code>realSFS:</code> Settings for realSFS<ul> <li><code>fold:</code> Whether or not to fold the produced SFS. Set to 1 if you have not   provided an ancestral-state reference (0 or 1, docs)</li> <li><code>sfsboot:</code> Determines number of bootstrap replicates to use when   requesting bootstrapped SFS. Is used for both 1dsfs and 2dsfs (this is   very easy to separate, open an issue if desired). Automatically used   for heterozygosity analysis to calculate confidence intervals. (integer)</li> </ul> </li> <li><code>fst:</code> Settings for $F_{ST}$ calculation in ANGSD<ul> <li><code>whichFst:</code> Determines which $F_{ST}$ estimator is used by ANGSD. With 0   being the default Reynolds 1983 and 1 being the Bhatia 2013 estimator.   The latter is preferable for small or uneven sample sizes   (0 or 1, docs)</li> <li><code>win_size:</code> Window size in bp for sliding window analysis (integer)</li> <li><code>win_step:</code> Window step size in bp for sliding window analysis (integer)</li> </ul> </li> <li><code>thetas:</code> Settings for pi, theta, and Tajima's D estimation<ul> <li><code>win_size:</code> Window size in bp for sliding window analysis (integer)</li> <li><code>win_step:</code> Window step size in bp for sliding window analysis (integer)</li> <li><code>minsites:</code> Minimum sites to include window in report plot. This does not   remove them from the actual output, just the report plot.</li> </ul> </li> <li><code>ngsadmix:</code> Settings for admixture analysis with NGSadmix. This analysis is     performed for a set of K groupings, and each K has several replicates     performed. Replicates will continue until a set of N highest likelihood     replicates converge, or the number of replicates reaches an upper threshold     set here. Defaults for <code>reps</code>, <code>minreps</code>, <code>thresh</code>, and <code>conv</code> can be left     as default for most.<ul> <li><code>kvalues:</code> A list of values of K to fit the data to (list of integers)</li> <li><code>reps:</code> The maximum number of replicates to perform per K. Default is 100.   (integer)</li> <li><code>minreps:</code> The minimum number of replicates to perform, even if   replicates have converged. Default is 20. (integer)</li> <li><code>thresh:</code> The convergence threshold - the top replicates must all be   within this value of log-likelihood units to consider the run converged.   Default is 2. (integer)</li> <li><code>conv:</code> The number of top replicates to include in convergence   assessment. Default is 3. (integer)</li> <li><code>extra:</code> Additional arguments to pass to NGSadmix (for instance,   increasing <code>-maxiter</code>). (string, docs)</li> </ul> </li> <li><code>ibs:</code> Settings for identity by state calculation with ANGSD<ul> <li><code>-doIBS:</code> Whether to use a random (1) or consensus (2) base in IBS   distance calculation   (docs)</li> </ul> </li> </ul>"},{"location":"getting-started/","title":"Getting Started","text":""},{"location":"getting-started/#tutorial","title":"Tutorial","text":"<p>Note</p> <p>A tutorial is in progress, but not yet available. The pipeline can still be used by following the rest of the guide.</p> <p>A tutorial is available with a small(ish) dataset where biologically meaningful results can be produced. This can help get an understanding of a good workflow to use different modules. You can also follow along with your own data and just skip analyses you don't want. If you prefer to just jump in instead, below describes how to quickly get a new project up and running.</p>"},{"location":"getting-started/#requirements","title":"Requirements","text":"<p>This pipeline can be run on Linux systems with Conda and Apptainer/Singularity installed. All other dependencies will be handled with the workflow, and thus, sufficient storage space is needed for these installations (~10GB, but this needs verification). It can be run on a local workstation with sufficient resources and storage space (dataset dependent), but is aimed at execution on high performance computing systems with job queuing systems.</p> <p>Data-wise, you'll need a reference genome (uncompressed) and some sequencing data for your samples. The latter can be either raw fastq files, bam alignments to the reference, or accession numbers for already published fastq files.</p>"},{"location":"getting-started/#deploying-the-workflow","title":"Deploying the workflow","text":"<p>The pipeline can be deployed in two ways: (1) using Snakedeploy which will deploy the pipeline as a module (recommended); (2) clone the repository at the version/branch you prefer (recommended if you will change any workflow code).</p> <p>Both methods require a Snakemake environment to run the pipeline in.</p>"},{"location":"getting-started/#preparing-the-environment","title":"Preparing the environment","text":"<p>First, create an environment for Snakemake, including Snakedeploy if you intend to deploy that way:</p> <pre><code>mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy\n</code></pre> <p>If you already have a Snakemake environment, you can use that, so long as it has <code>snakemake</code> (not just <code>snakemake-minimal</code>) installed. Snakemake versions &gt;=7.25 are likely to work, but most testing is on 7.32.4. It is compatible with Snakemake v8, but you may need to install additional plugins for cluster execution due to the new executor plugin system. See the Snakemake docs for what additional executor plugin you might need to enable cluster execution for your system.</p> <p>Activate the Snakemake environment:</p> <pre><code>conda activate snakemake\n</code></pre>"},{"location":"getting-started/#deploying-with-snakedeploy","title":"Deploying with Snakedeploy","text":"<p>Make your working directory:</p> <pre><code>mkdir -p /path/to/work-dir\ncd /path/to/work-dir\n</code></pre> <p>And deploy the workflow, using the tag for the version you want to deploy:</p> <pre><code>snakedeploy deploy-workflow https://github.com/zjnolen/PopGLen . --tag v0.2.0\n</code></pre> <p>This will generate a simple Snakefile in a <code>workflow</code> folder that loads the pipeline as a module. It will also download the template <code>config.yaml</code>, <code>samples.tsv</code>, and <code>units.tsv</code> in the <code>config</code> folder.</p>"},{"location":"getting-started/#cloning-from-github","title":"Cloning from GitHub","text":"<p>Go to the folder you would like you working directory to be created in and clone the GitHub repo:</p> <pre><code>git clone https://github.com/zjnolen/PopGLen.git\n</code></pre> <p>If you would like, you can change the name of the directory:</p> <pre><code>mv PopGLen work-dir-name\n</code></pre> <p>Move into the working directory (<code>PopGLen</code> or <code>work-dir-name</code> if you changed it) and checkout the version you would like to use:</p> <pre><code>git checkout v0.2.0\n</code></pre> <p>This can also be used to checkout specific branches or commits.</p>"},{"location":"getting-started/#configuring-the-workflow","title":"Configuring the workflow","text":"<p>Now you are ready to configure the workflow, see the documentation for that here.</p>"},{"location":"high-memory-rules/","title":"Rules using large amounts of RAM","text":"<p>NOTE: This is a work in progress list. Trying to figure out what</p> <p>The biggest challenge with using this pipeline with other datasets is ensuring RAM is properly allocated. Many rules require very little RAM, and so the default allocations that come on your cluster per thread will likely do fine. However, some rules require considerably more RAM. These are:</p>"}]}
\ No newline at end of file