Skip to content

Commit

Permalink
Merge pull request #21 from zjnolen/bam-start
Browse files Browse the repository at this point in the history
Set up so workflow can start with bam files and only perform popgen analyses
  • Loading branch information
zjnolen authored Jan 12, 2024
2 parents 296990b + 8fd9edd commit 5969361
Show file tree
Hide file tree
Showing 13 changed files with 295 additions and 173 deletions.
3 changes: 3 additions & 0 deletions .test/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ analyses:
# quality control
qualimap: true
damageprofiler: true
mapdamage_rescale: true
# population genomic analyses
estimate_ld: true
ld_decay: true
Expand Down Expand Up @@ -79,6 +80,8 @@ mapQ: 30
baseQ: 30

params:
clipoverlap:
clip_user_provided_bams: true
genmap:
K: "100"
E: "2"
Expand Down
16 changes: 8 additions & 8 deletions .test/config/units.tsv
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
sample unit lib platform fq1 fq2 realname
hist1 BHVN22DSX2.2 hist1 ILLUMINA data/fastq/hist1.r1.fastq.gz data/fastq/hist1.r2.fastq.gz MZLU153040
hist1 BHVN22DSX2.3 hist1 ILLUMINA data/fastq/hist1.unit2.r1.fastq.gz data/fastq/hist1.unit2.r2.fastq.gz MZLU153040
hist2 BHVN22DSX2.2 hist2 ILLUMINA data/fastq/hist2.r1.fastq.gz data/fastq/hist2.r2.fastq.gz MZLU153042
hist3 BHVN22DSX2.2 hist2 ILLUMINA data/fastq/hist3.r1.fastq.gz data/fastq/hist3.r2.fastq.gz MZLU153045
mod1 AHW5NGDSX2.3 mod1 ILLUMINA data/fastq/mod1.r1.fastq.gz data/fastq/mod1.r2.fastq.gz MZLU107434
mod2 AHW5NGDSX2.3 mod2 ILLUMINA data/fastq/mod2.r1.fastq.gz data/fastq/mod2.r2.fastq.gz MZLU107435
mod3 AHW5NGDSX2.3 mod3 ILLUMINA data/fastq/mod3.r1.fastq.gz data/fastq/mod3.r2.fastq.gz MZLU107436
sample unit lib platform fq1 fq2 bam realname
hist1 BHVN22DSX2.2 hist1 ILLUMINA data/fastq/hist1.r1.fastq.gz data/fastq/hist1.r2.fastq.gz MZLU153040
hist1 BHVN22DSX2.3 hist1 ILLUMINA data/fastq/hist1.unit2.r1.fastq.gz data/fastq/hist1.unit2.r2.fastq.gz MZLU153040
hist2 BHVN22DSX2.2 hist2 ILLUMINA data/fastq/hist2.r1.fastq.gz data/fastq/hist2.r2.fastq.gz MZLU153042
hist3 BHVN22DSX2.2 hist2 ILLUMINA data/fastq/hist3.r1.fastq.gz data/fastq/hist3.r2.fastq.gz MZLU153045
mod1 AHW5NGDSX2.3 mod1 ILLUMINA data/fastq/mod1.r1.fastq.gz data/fastq/mod1.r2.fastq.gz MZLU107434
mod2 AHW5NGDSX2.3 mod2 ILLUMINA data/fastq/mod2.r1.fastq.gz data/fastq/mod2.r2.fastq.gz data/bam/mod2.bam MZLU107435
mod3 AHW5NGDSX2.3 mod3 ILLUMINA data/fastq/mod3.r1.fastq.gz data/fastq/mod3.r2.fastq.gz MZLU107436
Binary file added .test/data/bam/mod2.bam
Binary file not shown.
195 changes: 86 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,77 @@
# Genotype likelihood population genomics pipeline

This workflow is aimed at processing raw sequencing reads and calculating
population genomic statistics within a genotype likelihood framework. As it is
focused on GL methods, it has options to adapt the workflow for processing data
with low or variable coverage and/or contains historical/ancient samples with
degraded DNA. It is under active development, so new features will be added.
This workflow is aimed at processing sequencing data and calculating population
genomic statistics within a genotype likelihood framework. As a primary use
case of genotype likelihood based methods is analysis of samples sequenced to
low depth or from degraded samples, processing can optionally be adapted to
account for DNA damage. It is aimed at single and multi-population analyses of
samples mapped to the same reference, so is appropriate for datasets containing
individuals from a single or multiple populations and allows for examining
population structure, genetic diversity, genetic differentiation, allele
frequencies, linkage disequilibrium, and more.

The workflow is designed with two entry points in mind. Users with raw
sequencing data in FASTQ format can perform raw sequencing data alignment and
processing, followed up by the population genomic analyses. Alternatively, users
who have already mapped and processed their reads into BAM files can use these
to start directly at the population genomic analyses (for instance, if you
bring BAM files from GenErode, nf-core/eager, or your own custom processing).

## Features

### Sequence data processing

If starting with raw sequencing data in FASTQ format, the pipeline will handle
mapping and processing of the reads, with options speific for historical DNA.

- Trimming and collapsing of overlapping paired end reads with fastp
- Mapping of reads using bwa mem
- Estimation of read mapping rates (for endogenous content calculation)
- Removal of duplicates using Picard for paired reads and dedup for collapsed
overlapping reads
- Realignment around indels using GATK
- Optional base quality recalibration using MapDamage2 in historical samples to
correct post-mortem damage
- Clipping of overlapping mapped paired end reads using bamutil
- Quality control information from fastp, Qualimap, DamageProfiler, and
MapDamage2 (Qualimap is also available for users starting with BAM files)

### Population Genomics

The primary goal of this pipeline is to perform analyses in ANGSD and related
softwares in a repeatable and automated way. These analyses are available both
when you start with raw sequencing data or with processed BAM files.

- Estimation of linkage disequilibrium across genome and LD decay using ngsLD
- Linkage pruning where relevant with ngsLD
- PCA with PCAngsd
- Admixture with NGSAdmix
- Relatedness using NgsRelate and IBSrelate
- 1D and 2D Site frequency spectrum production with ANGSD
- Neutrality statistics per population (Watterson's theta, pairwise nucleotide
diversity, Tajima's D) in user defined sliding windows with ANGSD
- Estimation of heterozygosity per sample from 1D SFS with ANGSD
- Pairwise $F_{ST}$ between all populations or individuals in user defined
sliding windows with ANGSD
- Inbreeding coefficients and runs of homozygosity per sample with ngsF-HMM
(**NOTE** This is currently only possible for samples that are within a
population sample, not for lone samples which will always return an
inbreeding coefficient of 0)
- Identity by state (IBS) matrix between all samples using ANGSD

Additionally, several data filtering options are available:

- Identification and removal of repetitive regions
- Removal of regions with low mappability for fragments of a specified size
- Removal of regions with extreme high or low depth
- Removal of regions with a certain amount of missing data

## Getting Started

To run this workflow, you'll need paired-end raw sequencing data and an
uncompressed reference genome to map it to.
The workflow will require data in either FASTQ or BAM format, as well as a
single (uncompressed) reference genome that the reads will be or have been
mapped to. Analyses are intended to be between samples and populations of
samples mapped to the same reference.

To run the workflow, you will need to be working on a machine with the
following:
Expand All @@ -25,6 +87,8 @@ queues, setting default resources).

### Notes on inputs

#### Reference Genomes

Reference genomes should be uncompressed, and contig names should be clear and
concise. Currently, there are some issues parsing contig names with
underscores, so please change these in your reference before running the
Expand All @@ -35,6 +99,21 @@ Potentially the ability to use bgzipped genomes will be added, I just need to
check that it works with all underlying tools. Currently, it will for sure not
work, as calculating chunks is hard-coded to work on an uncompressed genome.

#### BAM Files

BAM files will receive no processing, aside from optionally clipping
overlapping read pairs that are otherwise double counted in ANGSD. Ensure any
processing (duplicate removal, damage correction, realignment) you wish to
include has already been performed. Note: Filtering is handled in ANGSD, so
there is no need to filter out reads based on mapping or base quality, multi-
mapping, etc. unless you wish to filter on a parameter not available to ANGSD.

Some QC will not be available for users starting at BAM files. No read
processing QC can be produced, and should be done beforehand. While mapping
percentages are calculated, these may not entirely represent the truth, as they
can't account for anything already removed, such as duplicates or unmapped
reads.

### Running on a cluster

Development was done on UPPMAX's Rackham cluster, and a simple profile is
Expand All @@ -61,108 +140,6 @@ this. Right now, memory is only ever defined through threads, so you may need
to lower the threads and add a memory resource to some rules using this method
in order to optimize them for your system.

## Features

Currently, the pipeline performs the following tasks:

### Reference genome preparation

- Indexing of reference for subsequent analyses

### Raw read preparation

- Trimming of paired-end reads from high quality libraries
- Collapsing of paired-end reads from fragmented (aDNA/historical DNA)
libraries

### Read mapping

- Mapping prepared raw reads to reference using bwa-mem and clipping of
overlapping reads, combining of multiple sample runs/libraries
- **NOTE**: Reads marked as historical (degraded) will only map reads short
reads that overlap and collapse, to reduce mapping of likely contaminants.
- Removal of PCR and sequencing duplicates separately for fresh (Picard) and
fragmented (DeDup) DNA reads
- Realignment around indels
- Optional recalibration of base quality scores on degraded DNA bam files with
[MapDamage2](https://ginolhac.github.io/mapDamage/)
- Indexing of deduplicated, realigned, mapped, and recalibrated reads

### Sample quality control

- Assess post-mortem DNA damage with DamageProfiler
- Assess mapping quality stats with Qualimap
- Assess endogenous content using mapping proportion before duplicate reads are
removed

### Data quality filtering

- Analyses can be set with minimum mapping and base quality thresholds
- Exclusion of entire scaffolds (i.e. sex-linked, low quality) through user
config (both list and contig size based)
- Exclusion of repeat regions from analyses using RepeatModeler/RepeatMasker
- Exclusion of low mappability regions with GenMap
- Exclusion of sites with extreme global depth values (determined separately
for the entire dataset, and subsets at certain coverage ranges, then merged)
- Exclusion of sites based on data missingness across dataset and/or per
population
- Filter using any number of additional user-defined BED files

### GL based population genomic analyses

To speed up the pipeline, many of these analyses are done for part of the
genome at a time, then later merged. This is only done for analyses where
possible and where the time saved is appreciable. These chunks are made to be a
user configured size to allowtuning of run-times (i.e. more jobs, shorter
runtimes vs fewer jobs, longer runtimes).

SAF based analyses are done on variable and non-variable sites passing quality
filters. This set is the same across all populations in the dataset and is
based on the positions passing all the requested filters. Beagle (SNP) based
analyses are done on a SNP set that is constant across all populations,
determined from the output of the Beagle file for the whole dataset, and major
and minor alleles are inferred from the whole dataset. When relevant, pruned
SNPs are used. Pruning is done on the whole dataset beagle file and the same
pruned sites are used for all populations.

Additionally, all analyses can be repeated with samples subsampled to a lower
user configured depth. This helps to ensure results are not simply due to
variance in depth between groups.

**Analyses:**

- Estimation of linkage disequilibrium across genome and LD decay using ngsLD
- Linkage pruning where relevant with ngsLD
- PCA with PCAngsd
- Admixture with NGSAdmix
- Relatedness using NgsRelate and IBSrelate
- 1D and 2D Site frequency spectrum production with ANGSD
- Neutrality statistics per population (Watterson's theta, pairwise pi,
Tajima's D) in user defined sliding windows with ANGSD
- Estimation of heterozygosity per sample from 1D SFS with ANGSD
- Pairwise $F_{ST}$ between all populations or individuals in user defined
sliding windows with ANGSD
- Inbreeding coefficients and runs of homozygosity per sample with ngsF-HMM
(**NOTE** This is currently only possible for samples that are within a
population sampling, not for lone samples which will always return an
inbreeding coefficient of 0)
- Pairwise $F_{ST}$ between all populations or individuals in user defined
sliding windows with ANGSD
- Identity by state (IBS) matrix between all samples using ANGSD

### Planned

Some additional components to the pipeline are planned, the order below roughly
corresponding to priority:

- Add calculation of bootstrapped SFS
- Manhattan plots in report for sliding window results
- Allow starting with bam files - for those that want to process raw reads in
their own way before performing analyses
- Add calculation of Dxy
- Add schema for configuration files to improve incorrect format handling and
to enable defaults

## Workflow directed action graph

Below is a graph of the workflow with most of the analyses enabled. This is
Expand Down
39 changes: 28 additions & 11 deletions config/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,19 +37,27 @@ Each sample must have five columns filled in the units sheet. The columns are
tab separated:

- `sample` - The sample name, same as in `samples.tsv`.
- `unit` - This is used to fill out the `ID` read group in the bam file. It
must be unique to each read group, so the same sample shouldn't have the
same unit for more than one sequencing run. A good format might be
`sequencer_barcode.lane`. Optical duplicates will be removed within units.
- `lib` - This is used to fill out the `LB` read group. This should be a unique
identifier for each sample library. Sequencing runs from the same library,
but different runs, will have the same value in `lib`, but different in
`unit`.
- `unit` - This describes the sequencing platform unit. The expected format is
`sequencerbarcode.lane`. It will be used to fill the `PU` readgroup, and
combined with the `lib` column to fill the `ID` read group.
- `lib` - This is used to fill out the `LB` read group and combined with `unit`
to fill out the `ID` read group. This should be a unique identifier for each
library. Sequencing runs from the same library, but different runs, will have
the same value in `lib`, but different in `unit`.
- `platform` - This is used to fill out the `PL` read group. Put what you'd
want there. Usually `ILLUMINA` for Illumina platforms.
- `fq1` and `fq2` - The absolute or relative paths from the working directory
to the raw fastq files for the sample. Currently the pipeline only supports
paired-end sequencing, single end may be added down the line.
- `bam` - If you do not want to map the raw reads, provide a pre-processed
BAM file path here. Samples with a BAM file may only appear once in the units
list, so multiple sequencing runs should be merged beforehand. If a BAM file
is listed, `fq1` and `fq2` are ignored, and the BAM file is used for that
sample.

Note: Your dataset can include both samples that start at FASTQ and at BAM. If
a BAM is listed, it will be used instead of mapping, but if not, the FASTQ
files will be mapped.

## Configuration file

Expand Down Expand Up @@ -181,9 +189,9 @@ settings for each analysis are set in the next section.
(`true`/`false`)
- `damageprofiler:` Estimate post-mortem DNA damage on historical samples
with Damageprofiler (`true`/`false`) NOTE: This just adds the addition of
Damageprofiler to the already default output of MapDamage. DNA damage will
always be estimated and rescaled by MapDamage for samples marked as
'historical'
Damageprofiler to the already default output of MapDamage.
- `mapdamage_rescale:` Rescale base quality scores using MapDamage2 to help
account for post-mortem damage in analyses (`true`/`false`) [docs](https://ginolhac.github.io/mapDamage/)
- `estimate_ld:` Estimate pairwise linkage disquilibrium between sites with
ngsLD for each popualation and the whole dataset. Note, only set this if
you want to generate the LD estimates for use in downstream analyses
Expand Down Expand Up @@ -302,6 +310,15 @@ or a pull request and I'll gladly put it in.
filtered out. (integer)

- `params:`
- `clipoverlap:`
- `clip_user_provided_bams:` Determines whether overlapping read pairs will
be clipped in BAM files supplied by users. This is useful as many variant
callers will account for overlapping reads in their processing, but ANGSD
will double count overlapping reads. If BAMs were prepped without this in
mind, it can be good to apply before running through ANGSD. However, it
essentially creates a BAM file of nearly equal size for every sample, so
it may be nice to turn off if you don't care for this correction or have
already applied it on the BAMs you supply. (`true`/`false`)
- `genmap:` Parameters for mappability analysis, see [GenMap's documentation](https://github.com/cpockrandt/genmap/)
for more details.
- `K:`
Expand Down
3 changes: 3 additions & 0 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ analyses:
# quality control
qualimap: true
damageprofiler: true
mapdamage_rescale: true
# population genomic analyses
estimate_ld: false
ld_decay: false
Expand Down Expand Up @@ -80,6 +81,8 @@ mapQ: 30
baseQ: 30

params: true
clipoverlap:
clip_user_provided_bams: true
genmap:
K: "100"
E: "2"
Expand Down
2 changes: 1 addition & 1 deletion config/units.tsv
Original file line number Diff line number Diff line change
@@ -1 +1 @@
sample unit lib platform fq1 fq2
sample unit lib platform fq1 fq2 bam
5 changes: 3 additions & 2 deletions environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@ channels:

dependencies:
# core packages for workflow
- snakemake-minimal =7.32.*
- python =3.11.*
- snakemake-minimal =7.25.*
- pandas =2.0.1
- mamba
- graphviz
- pygments
- pygments
- pulp =2.7.*
12 changes: 11 additions & 1 deletion workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,17 @@ all_outputs = [
]

if config["analyses"]["qualimap"]:
all_outputs.append("results/mapping/qc/qualimap/{sample}.{ref}/report.pdf")
all_outputs.extend(
expand(
"results/mapping/qc/qualimap/{sample}.{{ref}}/report.pdf", sample=pipebams
)
)
all_outputs.extend(
expand(
"results/datasets/{{dataset}}/qc/user-provided-bams/qualimap/{sample}.{{ref}}/report.pdf",
sample=userbams,
)
)

if config["analyses"]["damageprofiler"]:
all_outputs.append(
Expand Down
Loading

0 comments on commit 5969361

Please sign in to comment.