Skip to content

Commit

Permalink
Change where user input bam links go
Browse files Browse the repository at this point in the history
This will make sure that mutliple datasets can
share user provided bams.
  • Loading branch information
zjnolen committed Jan 19, 2024
1 parent d41b61b commit 765eecf
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 14 deletions.
27 changes: 21 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,15 +105,30 @@ work, as calculating chunks is hard-coded to work on an uncompressed genome.
BAM files will receive no processing, aside from optionally clipping
overlapping read pairs that are otherwise double counted in ANGSD. Ensure any
processing (duplicate removal, damage correction, realignment) you wish to
include has already been performed. Note: Filtering is handled in ANGSD, so
there is no need to filter out reads based on mapping or base quality, multi-
mapping, etc. unless you wish to filter on a parameter not available to ANGSD.
include has already been performed. Ideally, it would be good to clip the
overlapping reads before running the workflow if you are supplying your own
bams, and turn off clipping of user provided bams in the pipeline config. This
prevents doubling the storage usage, as otherwise all your bams get duplicated,
but with clipped reads. Doing this beforehand with [bamutil](https://genome.sph.umich.edu/wiki/BamUtil)
is straightforward: `bam clipOverlap --in {input.bam} --out {output.bam}`.
After this, you can remove the original bams to save storage.

This pipeline is written with reusing of samples across datasets in mind. For
samples starting at fastq, this should be done seamlessly by reusing samples
in different datasets (as set in the config file) processed in the same working
directory. This, in most cases, will also be true for bam input. However, if
you are planning to process datasets where different bam files will be used
as input for a sample given the same sample name and reference name in both
dataset configs (for instance, to run the same samples with and without
MapDamage rescaling), the second dataset will overwrite the bams of the first,
and Snakemake will suggest rerunning the first dataset. In this case, it is
best to have different sample names for the different input bams, or to run the
two datasets in different working directories.

Some QC will not be available for users starting at BAM files. No read
processing QC can be produced, and should be done beforehand. While mapping
processing QC can be produced and should be done beforehand. While mapping
percentages are calculated, these may not entirely represent the truth, as they
can't account for anything already removed, such as duplicates or unmapped
reads.
may not account for anything already fully removed from the bam.

### Running on a cluster

Expand Down
8 changes: 4 additions & 4 deletions workflow/rules/2.0_mapping.smk
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ rule symlink_userbams:
input:
bam=lambda w: units.loc[units["sample"] == w.sample, "bam"].values[0],
output:
bam="results/datasets/{dataset}/bams/user_bams/{sample}.{ref}.user-processed.bam",
bam="results/mapping/user-provided-bams/{sample}.{ref}.user-processed.bam",
log:
"logs/{dataset}/symlink_bams/{sample}.{ref}.user-processed.log",
conda:
Expand All @@ -278,11 +278,11 @@ rule symlink_userbams:
rule bam_clipoverlap_userbams:
"""Clip overlapping reads in paired end bam files provided by users"""
input:
bam="results/datasets/{dataset}/bams/user_bams/{sample}.{ref}.user-processed.bam",
bam="results/mapping/user-provided-bams/{sample}.{ref}.user-processed.bam",
ref="results/ref/{ref}/{ref}.fa",
output:
bam="results/datasets/{dataset}/bams/clipped_user_bams/{sample}.{ref}.clip.bam",
log="results/datasets/{dataset}/bams/clipped_user_bams/{sample}.{ref}.clipoverlap.stats",
bam="results/mapping/user-provided-bams/{sample}.{ref}.clip.bam",
log="results/mapping/user-provided-bams/{sample}.{ref}.clipoverlap.stats",
log:
"logs/mapping/bamutil/clipoverlap/{dataset}.{sample}.{ref}.log",
benchmark:
Expand Down
8 changes: 4 additions & 4 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -379,12 +379,12 @@ def get_final_bam(wildcards):
)
if config["params"]["clipoverlap"]["clip_user_provided_bams"]:
return {
"bam": "results/datasets/{dataset}/bams/clipped_user_bams/{sample}.{ref}.clip.bam",
"bai": "results/datasets/{dataset}/bams/clipped_user_bams/{sample}.{ref}.clip.bam.bai",
"bam": "results/mapping/user-provided-bams/{sample}.{ref}.clip.bam",
"bai": "results/mapping/user-provided-bams/{sample}.{ref}.clip.bam.bai",
}
return {
"bam": "results/datasets/{dataset}/bams/user_bams/{sample}.{ref}.user-processed.bam",
"bai": "results/datasets/{dataset}/bams/user_bams/{sample}.{ref}.user-processed.bam.bai",
"bam": "results/mapping/user-provided-bams/{sample}.{ref}.user-processed.bam",
"bai": "results/mapping/user-provided-bams/{sample}.{ref}.user-processed.bam.bai",
}
if (s in samples.index[samples.time == "historical"]) and (
config["analyses"]["mapdamage_rescale"]
Expand Down

0 comments on commit 765eecf

Please sign in to comment.