From dbe1e9f32791df4e4a5f6439112484b906db3862 Mon Sep 17 00:00:00 2001 From: Docs Deploy Date: Mon, 24 Jun 2024 14:12:28 +0000 Subject: [PATCH] Deployed 0edb4d7 to develop with MkDocs 1.6.0 and mike 2.1.1 --- develop/index.html | 4 ++-- develop/search/search_index.json | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/develop/index.html b/develop/index.html index afbbfa1..151db27 100644 --- a/develop/index.html +++ b/develop/index.html @@ -491,10 +491,10 @@

Pipeline Summary

  • Relatedness (NGSRelate, IBSrelate)
  • Identity by state matrix (ANGSD)
  • Site frequency spectrum (ANGSD)
  • -
  • Watterson's estimator (θ~w~), Nucleotide diversity (π), Tajima's D +
  • Watterson's estimator (θw), Nucleotide diversity (π), Tajima's D (ANGSD)
  • Individual heterozygosity with bootstrapped confidence intervals (ANGSD)
  • -
  • Pairwise F~ST~ (ANGSD)
  • +
  • Pairwise FST (ANGSD)
  • These all can be enabled and processed independently, and the pipeline will generate genotype likelihood input files using ANGSD and share them across diff --git a/develop/search/search_index.json b/develop/search/search_index.json index b62ef6e..4a4a00f 100644 --- a/develop/search/search_index.json +++ b/develop/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to the documentation for PopGLen","text":"

    PopGLen is aimed at enabling users to run population genomic analyses on their data within a genotype likelihood framework in an automated and reproducible fashion. Genotype likelihood based analyses avoid genotype calling, instead performing analyses on the likelihoods of each possible genotype, incorporating uncertainty about the true genotype into the analysis. This makes them especially suited for datasets with low coverage or that vary in coverage.

    This pipeline was developed in large part to make my own analyses easier. I work with many species being mapped to their own references within the same project. I developed this pipeline so that I could ensure standardized processing for datasets within the same project and to automate the many steps that go into performing these analyses. As it needed to fit many datasets, it is generalizable and customizable through a single configuration file and uses a common workflow utilized by ANGSD users, so it is available for others to use, should it suit their needs.

    Questions? Feature requests? Just ask!

    I'm glad to answer questions on the GitHub Issues page for the project, as well as take suggestions for features or improvements!

    "},{"location":"#pipeline-summary","title":"Pipeline Summary","text":"

    The pipeline aims to follow the general path many users will use when working with ANGSD and other GL based tools. Raw equencing data is processed into BAM files (with optional configuration for historical degraded samples) or BAM files are provided directly. From there several quality control reports are generated to help determine what samples are included. The pipeline then builds a 'sites' file to perform analyses with. This sites file is made from several user-configured filters, intersecting all and outputing a list of sites for the analyses to be performed on across all samples. This can also be extended by user-provided filter lists (e.g. to limit to neutral sites, genic regions, etc.).

    After samples have been processed, quality control reports produced, and the sites file has been produced, the pipeline can continue to the analyses.

    These all can be enabled and processed independently, and the pipeline will generate genotype likelihood input files using ANGSD and share them across analyses as appropriate, deleting temporary intermediate files when they are no longer needed.

    At any point after a successful completion of a portion of the pipeline, a report can be made that contains tables and figures summarizing the results for the currently enabled parts of the pipeline.

    If you're interested in using this, head to the Getting Started page!

    "},{"location":"config/","title":"Configuring the workflow","text":"

    Running the workflow requires configuring three files: config.yaml, samples.tsv, and units.tsv. config.yaml is used to configure the analyses, samples.tsv categorizes your samples into groups, and units.tsv connects sample names to their input data files. The workflow will use config/config.yaml automatically, but you can name this whatever you want (good for separating datasets in same working directory) and point to it when running snakemake with --configfile <path>.

    "},{"location":"config/#samplestsv","title":"samples.tsv","text":"

    This file contains your sample list, and has four tab separated columns:

    sample\tpopulation\ttime\tdepth\nhist1\tHjelmseryd\thistorical\tlow\nhist2\tHjelmseryd\thistorical\tlow\nhist3\tHjelmseryd\thistorical\tlow\nmod1\tGotafors\tmodern\thigh\nmod2\tGotafors\tmodern\thigh\nmod3\tGotafors\tmodern\thigh\n
    "},{"location":"config/#unitstsv","title":"units.tsv","text":"

    This file connects your samples to input files and has a potential for eight tab separated columns:

    sample\tunit\tlib\tplatform\tfq1\tfq2\tbam\tsra\nhist1\tBHVN22DSX2.2\thist1\tILLUMINA\tdata/fastq/hist1.r1.fastq.gz\tdata/fastq/hist1.r2.fastq.gz\t\nhist1\tBHVN22DSX2.3\thist1\tILLUMINA\tdata/fastq/hist1.unit2.r1.fastq.gz\tdata/fastq/hist1.unit2.r2.fastq.gz\t\nhist2\tBHVN22DSX2.2\thist2\tILLUMINA\tdata/fastq/hist2.r1.fastq.gz\tdata/fastq/hist2.r2.fastq.gz\t\nhist3\tBHVN22DSX2.2\thist2\tILLUMINA\tdata/fastq/hist3.r1.fastq.gz\tdata/fastq/hist3.r2.fastq.gz\t\nmod1\tAHW5NGDSX2.3\tmod1\tILLUMINA\tdata/fastq/mod1.r1.fastq.gz\tdata/fastq/mod1.r2.fastq.gz\t\nmod2\tAHW5NGDSX2.3\tmod2\tILLUMINA\t\t\tdata/bam/mod2.bam\nmod3\tAHW5NGDSX2.3\tmod3\tILLUMINA\tdata/fastq/mod3.r1.fastq.gz\tdata/fastq/mod3.r2.fastq.gz\t\nSAMN13218652\tSRR10398077\tSAMN13218652\tILLUMINA\t\t\t\tSRR10398077\n

    Mixing samples with different starting points

    It is possible to have different samples start from different inputs (i.e. some from bam, others from fastq, others from SRA). It is best to provide only fq1+fq2, bam, or sra for each sample to be clear where each sample starts. If multiple are provided for the same sample, the bam file will override fastq or SRA entries, and the fastq will override SRA entries. Note that this means it is not currently possible to have multiple starting points for the same sample (i.e. FASTQ reads that would be processed then merged into an existing BAM).

    "},{"location":"config/#configuration-file","title":"Configuration file","text":"

    config.yaml contains the configuration for the workflow, this is where you will put what analyses, filters, and options you want. Below I describe the configuration options. The config.yaml in this repository serves as a template, but includes some 'default' parameters that may be good starting points for some users. If --configfile is not specified in the snakemake command, the workflow will default to config/config.yaml.

    "},{"location":"config/#configuration-options","title":"Configuration options","text":""},{"location":"config/#dataset-configuration","title":"Dataset Configuration","text":"

    Required configuration of the 'dataset'.

    Here, dataset means a set of samples and configurations that the workflow will be run with. Each dataset should have its own samples.tsv and config.yaml, but the same units.tsv can be used for multiple if you prefer. Essentially, what the dataset identifier does is keeps your outputs organized into projects, so that the same BAM files can be used in multiple datasets without having to be remade.

    So, say you have dataset1_samples.tsv and dataset2_samples.tsv, with corresponding dataset1_config.tsv and dataset2_config.yaml. The sample files contain different samples, though some are shared between the datasets. The workflow for dataset1 can be run, and then dataset2 can be run. When dataset2 runs, it map new samples, but won't re-map samples processed in dataset1. Each will perform downstream analyses independently with their sample set and configuration files, storing these results in dataset specific folders.

    "},{"location":"config/#reference-configuration","title":"Reference Configuration","text":"

    Required configuration of the reference.

    Reference genomes should be uncompressed, and contig names should be clear and concise. Currently, there are some issues parsing contig names with underscores, so please change these in your reference before running the pipeline. Alphanumeric characters, as well as . in contig names have been tested to work so far, other symbols have not been tested.

    Potentially the ability to use bgzipped genomes will be added, I just need to check that it works with all underlying tools. Currently, it will for sure not work, as calculating chunks is hard-coded to work on an uncompressed genome.

    "},{"location":"config/#sample-set-configuration","title":"Sample Set Configuration","text":""},{"location":"config/#analysis-selection","title":"Analysis Selection","text":"

    Here, you will define which analyses you will perform. It is useful to start with only a few, and add more in subsequent workflow runs, just to ensure you catch errors before you use compute time running all analyses. Most are set with (true/false) or a value, described below. Modifications to the settings for each analysis are set in the next section.

    "},{"location":"config/#subsampling-section","title":"Subsampling Section","text":"

    As this workflow is aimed at low coverage samples, its likely there might be considerable variance in sample depth. For this reason, it may be good to subsample all your samples to a similar depth to examine if variation in depth is influencing results. To do this, set an integer value here to subsample all your samples down to and run specific analyses. This subsampling can be done in reference to the unfiltered sequencing depth, the mapping and base quality filtered sequencing depth, or the filtered sites sequencing depth. The latter is recommended, as this will ensure that sequencing depth is made uniform at the analysis stage, as it is these filtered sites that analyses are performed for.

    "},{"location":"config/#filter-sets","title":"Filter Sets","text":"

    By default, this workflow will perform all analyses requested in the above section on all sites that pass the filters set in the above section. These outputs will contain allsites-filts in the filename and in the report. However, many times, it is useful to perform an analysis on different subsets of sites, for instance, to compare results for genic vs. intergenic regions, neutral sites, exons vs. introns, etc. Here, users can set an arbitrary number of additional filters using BED files. For each BED file supplied, the contents will be intersected with the sites passing the filters set in the above section, and all analyses will be performed additionally using those sites.

    For instance, given a BED file containing putatively neutral sites, one could set the following:

    filter_beds:\n  neutral-sites: \"resources/neutral_sites.bed\"\n

    In this case, for each requested analysis, in addition to the allsites-filts output, a neutral-filts (named after the key assigned to the BED file in config.yaml) output will also be generated, containing the results for sites within the specified BED file that passed any set filters.

    More than one BED file can be set, up to an arbitrary number:

    filter_beds:\n  neutral: \"resources/neutral_sites.bed\"\n  intergenic: \"resources/intergenic_sites.bed\"\n  introns: \"resources/introns.bed\"\n

    It may also sometimes be desireable to skip analyses on allsites-filts, say if you are trying to only generate diversity estimates or generate SFS for a set of neutral sites you supply.

    To skip running any analyses for allsites-filts and only perform them for the BED files you supply, you can set only_filter_beds: true in the config file. This may also be useful in the event you have a set of already filtered sites, and want to run the workflow on those, ignoring any of the built in filter options by setting them to false.

    "},{"location":"config/#software-configuration","title":"Software Configuration","text":"

    These are software specific settings that can be user configured in the workflow. If you are missing a configurable setting you need, open up an issue or a pull request and I'll gladly put it in.

    "},{"location":"getting-started/","title":"Getting Started","text":""},{"location":"getting-started/#tutorial","title":"Tutorial","text":"

    Note

    A tutorial is in progress, but not yet available. The pipeline can still be used by following the rest of the guide.

    A tutorial is available with a small(ish) dataset where biologically meaningful results can be produced. This can help get an understanding of a good workflow to use different modules. You can also follow along with your own data and just skip analyses you don't want. If you prefer to just jump in instead, below describes how to quickly get a new project up and running.

    "},{"location":"getting-started/#requirements","title":"Requirements","text":"

    This pipeline can be run on Linux systems with Conda and Apptainer/Singularity installed. All other dependencies will be handled with the workflow, and thus, sufficient storage space is needed for these installations (~10GB, but this needs verification). It can be run on a local workstation with sufficient resources and storage space (dataset dependent), but is aimed at execution on high performance computing systems with job queuing systems.

    Data-wise, you'll need a reference genome (uncompressed) and some sequencing data for your samples. The latter can be either raw fastq files, bam alignments to the reference, or accession numbers for already published fastq files.

    "},{"location":"getting-started/#deploying-the-workflow","title":"Deploying the workflow","text":"

    The pipeline can be deployed in two ways: (1) using Snakedeploy which will deploy the pipeline as a module (recommended); (2) clone the repository at the version/branch you prefer (recommended if you will change any workflow code).

    Both methods require a Snakemake environment to run the pipeline in.

    "},{"location":"getting-started/#preparing-the-environment","title":"Preparing the environment","text":"

    First, create an environment for Snakemake, including Snakedeploy if you intend to deploy that way:

    mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy\n

    If you already have a Snakemake environment, you can use that, so long as it has snakemake (not just snakemake-minimal) installed. Snakemake versions >=7.25 are likely to work, but most testing is on 7.32.4. It is compatible with Snakemake v8, but you may need to install additional plugins for cluster execution due to the new executor plugin system. See the Snakemake docs for what additional executor plugin you might need to enable cluster execution for your system.

    Activate the Snakemake environment:

    conda activate snakemake\n
    "},{"location":"getting-started/#deploying-with-snakedeploy","title":"Deploying with Snakedeploy","text":"

    Make your working directory:

    mkdir -p /path/to/work-dir\ncd /path/to/work-dir\n

    And deploy the workflow, using the tag for the version you want to deploy:

    snakedeploy deploy-workflow https://github.com/zjnolen/PopGLen . --tag v0.2.0\n

    This will generate a simple Snakefile in a workflow folder that loads the pipeline as a module. It will also download the template config.yaml, samples.tsv, and units.tsv in the config folder.

    "},{"location":"getting-started/#cloning-from-github","title":"Cloning from GitHub","text":"

    Go to the folder you would like you working directory to be created in and clone the GitHub repo:

    git clone https://github.com/zjnolen/PopGLen.git\n

    If you would like, you can change the name of the directory:

    mv PopGLen work-dir-name\n

    Move into the working directory (PopGLen or work-dir-name if you changed it) and checkout the version you would like to use:

    git checkout v0.2.0\n

    This can also be used to checkout specific branches or commits.

    "},{"location":"getting-started/#configuring-the-workflow","title":"Configuring the workflow","text":"

    Now you are ready to configure the workflow, see the documentation for that here.

    "},{"location":"high-memory-rules/","title":"Rules using large amounts of RAM","text":"

    NOTE: This is a work in progress list. Trying to figure out what

    The biggest challenge with using this pipeline with other datasets is ensuring RAM is properly allocated. Many rules require very little RAM, and so the default allocations that come on your cluster per thread will likely do fine. However, some rules require considerably more RAM. These are:

    "}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to the documentation for PopGLen","text":"

    PopGLen is aimed at enabling users to run population genomic analyses on their data within a genotype likelihood framework in an automated and reproducible fashion. Genotype likelihood based analyses avoid genotype calling, instead performing analyses on the likelihoods of each possible genotype, incorporating uncertainty about the true genotype into the analysis. This makes them especially suited for datasets with low coverage or that vary in coverage.

    This pipeline was developed in large part to make my own analyses easier. I work with many species being mapped to their own references within the same project. I developed this pipeline so that I could ensure standardized processing for datasets within the same project and to automate the many steps that go into performing these analyses. As it needed to fit many datasets, it is generalizable and customizable through a single configuration file and uses a common workflow utilized by ANGSD users, so it is available for others to use, should it suit their needs.

    Questions? Feature requests? Just ask!

    I'm glad to answer questions on the GitHub Issues page for the project, as well as take suggestions for features or improvements!

    "},{"location":"#pipeline-summary","title":"Pipeline Summary","text":"

    The pipeline aims to follow the general path many users will use when working with ANGSD and other GL based tools. Raw equencing data is processed into BAM files (with optional configuration for historical degraded samples) or BAM files are provided directly. From there several quality control reports are generated to help determine what samples are included. The pipeline then builds a 'sites' file to perform analyses with. This sites file is made from several user-configured filters, intersecting all and outputing a list of sites for the analyses to be performed on across all samples. This can also be extended by user-provided filter lists (e.g. to limit to neutral sites, genic regions, etc.).

    After samples have been processed, quality control reports produced, and the sites file has been produced, the pipeline can continue to the analyses.

    These all can be enabled and processed independently, and the pipeline will generate genotype likelihood input files using ANGSD and share them across analyses as appropriate, deleting temporary intermediate files when they are no longer needed.

    At any point after a successful completion of a portion of the pipeline, a report can be made that contains tables and figures summarizing the results for the currently enabled parts of the pipeline.

    If you're interested in using this, head to the Getting Started page!

    "},{"location":"config/","title":"Configuring the workflow","text":"

    Running the workflow requires configuring three files: config.yaml, samples.tsv, and units.tsv. config.yaml is used to configure the analyses, samples.tsv categorizes your samples into groups, and units.tsv connects sample names to their input data files. The workflow will use config/config.yaml automatically, but you can name this whatever you want (good for separating datasets in same working directory) and point to it when running snakemake with --configfile <path>.

    "},{"location":"config/#samplestsv","title":"samples.tsv","text":"

    This file contains your sample list, and has four tab separated columns:

    sample\tpopulation\ttime\tdepth\nhist1\tHjelmseryd\thistorical\tlow\nhist2\tHjelmseryd\thistorical\tlow\nhist3\tHjelmseryd\thistorical\tlow\nmod1\tGotafors\tmodern\thigh\nmod2\tGotafors\tmodern\thigh\nmod3\tGotafors\tmodern\thigh\n
    "},{"location":"config/#unitstsv","title":"units.tsv","text":"

    This file connects your samples to input files and has a potential for eight tab separated columns:

    sample\tunit\tlib\tplatform\tfq1\tfq2\tbam\tsra\nhist1\tBHVN22DSX2.2\thist1\tILLUMINA\tdata/fastq/hist1.r1.fastq.gz\tdata/fastq/hist1.r2.fastq.gz\t\nhist1\tBHVN22DSX2.3\thist1\tILLUMINA\tdata/fastq/hist1.unit2.r1.fastq.gz\tdata/fastq/hist1.unit2.r2.fastq.gz\t\nhist2\tBHVN22DSX2.2\thist2\tILLUMINA\tdata/fastq/hist2.r1.fastq.gz\tdata/fastq/hist2.r2.fastq.gz\t\nhist3\tBHVN22DSX2.2\thist2\tILLUMINA\tdata/fastq/hist3.r1.fastq.gz\tdata/fastq/hist3.r2.fastq.gz\t\nmod1\tAHW5NGDSX2.3\tmod1\tILLUMINA\tdata/fastq/mod1.r1.fastq.gz\tdata/fastq/mod1.r2.fastq.gz\t\nmod2\tAHW5NGDSX2.3\tmod2\tILLUMINA\t\t\tdata/bam/mod2.bam\nmod3\tAHW5NGDSX2.3\tmod3\tILLUMINA\tdata/fastq/mod3.r1.fastq.gz\tdata/fastq/mod3.r2.fastq.gz\t\nSAMN13218652\tSRR10398077\tSAMN13218652\tILLUMINA\t\t\t\tSRR10398077\n

    Mixing samples with different starting points

    It is possible to have different samples start from different inputs (i.e. some from bam, others from fastq, others from SRA). It is best to provide only fq1+fq2, bam, or sra for each sample to be clear where each sample starts. If multiple are provided for the same sample, the bam file will override fastq or SRA entries, and the fastq will override SRA entries. Note that this means it is not currently possible to have multiple starting points for the same sample (i.e. FASTQ reads that would be processed then merged into an existing BAM).

    "},{"location":"config/#configuration-file","title":"Configuration file","text":"

    config.yaml contains the configuration for the workflow, this is where you will put what analyses, filters, and options you want. Below I describe the configuration options. The config.yaml in this repository serves as a template, but includes some 'default' parameters that may be good starting points for some users. If --configfile is not specified in the snakemake command, the workflow will default to config/config.yaml.

    "},{"location":"config/#configuration-options","title":"Configuration options","text":""},{"location":"config/#dataset-configuration","title":"Dataset Configuration","text":"

    Required configuration of the 'dataset'.

    Here, dataset means a set of samples and configurations that the workflow will be run with. Each dataset should have its own samples.tsv and config.yaml, but the same units.tsv can be used for multiple if you prefer. Essentially, what the dataset identifier does is keeps your outputs organized into projects, so that the same BAM files can be used in multiple datasets without having to be remade.

    So, say you have dataset1_samples.tsv and dataset2_samples.tsv, with corresponding dataset1_config.tsv and dataset2_config.yaml. The sample files contain different samples, though some are shared between the datasets. The workflow for dataset1 can be run, and then dataset2 can be run. When dataset2 runs, it map new samples, but won't re-map samples processed in dataset1. Each will perform downstream analyses independently with their sample set and configuration files, storing these results in dataset specific folders.

    "},{"location":"config/#reference-configuration","title":"Reference Configuration","text":"

    Required configuration of the reference.

    Reference genomes should be uncompressed, and contig names should be clear and concise. Currently, there are some issues parsing contig names with underscores, so please change these in your reference before running the pipeline. Alphanumeric characters, as well as . in contig names have been tested to work so far, other symbols have not been tested.

    Potentially the ability to use bgzipped genomes will be added, I just need to check that it works with all underlying tools. Currently, it will for sure not work, as calculating chunks is hard-coded to work on an uncompressed genome.

    "},{"location":"config/#sample-set-configuration","title":"Sample Set Configuration","text":""},{"location":"config/#analysis-selection","title":"Analysis Selection","text":"

    Here, you will define which analyses you will perform. It is useful to start with only a few, and add more in subsequent workflow runs, just to ensure you catch errors before you use compute time running all analyses. Most are set with (true/false) or a value, described below. Modifications to the settings for each analysis are set in the next section.

    "},{"location":"config/#subsampling-section","title":"Subsampling Section","text":"

    As this workflow is aimed at low coverage samples, its likely there might be considerable variance in sample depth. For this reason, it may be good to subsample all your samples to a similar depth to examine if variation in depth is influencing results. To do this, set an integer value here to subsample all your samples down to and run specific analyses. This subsampling can be done in reference to the unfiltered sequencing depth, the mapping and base quality filtered sequencing depth, or the filtered sites sequencing depth. The latter is recommended, as this will ensure that sequencing depth is made uniform at the analysis stage, as it is these filtered sites that analyses are performed for.

    "},{"location":"config/#filter-sets","title":"Filter Sets","text":"

    By default, this workflow will perform all analyses requested in the above section on all sites that pass the filters set in the above section. These outputs will contain allsites-filts in the filename and in the report. However, many times, it is useful to perform an analysis on different subsets of sites, for instance, to compare results for genic vs. intergenic regions, neutral sites, exons vs. introns, etc. Here, users can set an arbitrary number of additional filters using BED files. For each BED file supplied, the contents will be intersected with the sites passing the filters set in the above section, and all analyses will be performed additionally using those sites.

    For instance, given a BED file containing putatively neutral sites, one could set the following:

    filter_beds:\n  neutral-sites: \"resources/neutral_sites.bed\"\n

    In this case, for each requested analysis, in addition to the allsites-filts output, a neutral-filts (named after the key assigned to the BED file in config.yaml) output will also be generated, containing the results for sites within the specified BED file that passed any set filters.

    More than one BED file can be set, up to an arbitrary number:

    filter_beds:\n  neutral: \"resources/neutral_sites.bed\"\n  intergenic: \"resources/intergenic_sites.bed\"\n  introns: \"resources/introns.bed\"\n

    It may also sometimes be desireable to skip analyses on allsites-filts, say if you are trying to only generate diversity estimates or generate SFS for a set of neutral sites you supply.

    To skip running any analyses for allsites-filts and only perform them for the BED files you supply, you can set only_filter_beds: true in the config file. This may also be useful in the event you have a set of already filtered sites, and want to run the workflow on those, ignoring any of the built in filter options by setting them to false.

    "},{"location":"config/#software-configuration","title":"Software Configuration","text":"

    These are software specific settings that can be user configured in the workflow. If you are missing a configurable setting you need, open up an issue or a pull request and I'll gladly put it in.

    "},{"location":"getting-started/","title":"Getting Started","text":""},{"location":"getting-started/#tutorial","title":"Tutorial","text":"

    Note

    A tutorial is in progress, but not yet available. The pipeline can still be used by following the rest of the guide.

    A tutorial is available with a small(ish) dataset where biologically meaningful results can be produced. This can help get an understanding of a good workflow to use different modules. You can also follow along with your own data and just skip analyses you don't want. If you prefer to just jump in instead, below describes how to quickly get a new project up and running.

    "},{"location":"getting-started/#requirements","title":"Requirements","text":"

    This pipeline can be run on Linux systems with Conda and Apptainer/Singularity installed. All other dependencies will be handled with the workflow, and thus, sufficient storage space is needed for these installations (~10GB, but this needs verification). It can be run on a local workstation with sufficient resources and storage space (dataset dependent), but is aimed at execution on high performance computing systems with job queuing systems.

    Data-wise, you'll need a reference genome (uncompressed) and some sequencing data for your samples. The latter can be either raw fastq files, bam alignments to the reference, or accession numbers for already published fastq files.

    "},{"location":"getting-started/#deploying-the-workflow","title":"Deploying the workflow","text":"

    The pipeline can be deployed in two ways: (1) using Snakedeploy which will deploy the pipeline as a module (recommended); (2) clone the repository at the version/branch you prefer (recommended if you will change any workflow code).

    Both methods require a Snakemake environment to run the pipeline in.

    "},{"location":"getting-started/#preparing-the-environment","title":"Preparing the environment","text":"

    First, create an environment for Snakemake, including Snakedeploy if you intend to deploy that way:

    mamba create -c conda-forge -c bioconda --name snakemake snakemake snakedeploy\n

    If you already have a Snakemake environment, you can use that, so long as it has snakemake (not just snakemake-minimal) installed. Snakemake versions >=7.25 are likely to work, but most testing is on 7.32.4. It is compatible with Snakemake v8, but you may need to install additional plugins for cluster execution due to the new executor plugin system. See the Snakemake docs for what additional executor plugin you might need to enable cluster execution for your system.

    Activate the Snakemake environment:

    conda activate snakemake\n
    "},{"location":"getting-started/#deploying-with-snakedeploy","title":"Deploying with Snakedeploy","text":"

    Make your working directory:

    mkdir -p /path/to/work-dir\ncd /path/to/work-dir\n

    And deploy the workflow, using the tag for the version you want to deploy:

    snakedeploy deploy-workflow https://github.com/zjnolen/PopGLen . --tag v0.2.0\n

    This will generate a simple Snakefile in a workflow folder that loads the pipeline as a module. It will also download the template config.yaml, samples.tsv, and units.tsv in the config folder.

    "},{"location":"getting-started/#cloning-from-github","title":"Cloning from GitHub","text":"

    Go to the folder you would like you working directory to be created in and clone the GitHub repo:

    git clone https://github.com/zjnolen/PopGLen.git\n

    If you would like, you can change the name of the directory:

    mv PopGLen work-dir-name\n

    Move into the working directory (PopGLen or work-dir-name if you changed it) and checkout the version you would like to use:

    git checkout v0.2.0\n

    This can also be used to checkout specific branches or commits.

    "},{"location":"getting-started/#configuring-the-workflow","title":"Configuring the workflow","text":"

    Now you are ready to configure the workflow, see the documentation for that here.

    "},{"location":"high-memory-rules/","title":"Rules using large amounts of RAM","text":"

    NOTE: This is a work in progress list. Trying to figure out what

    The biggest challenge with using this pipeline with other datasets is ensuring RAM is properly allocated. Many rules require very little RAM, and so the default allocations that come on your cluster per thread will likely do fine. However, some rules require considerably more RAM. These are:

    "}]} \ No newline at end of file