Change where user input bam links go

This will make sure that mutliple datasets can share user provided bams.
zjnolen · Jan 19, 2024 · 765eecf · 765eecf
1 parent d41b61b
commit 765eecf
Show file tree

Hide file tree

Showing 3 changed files with 29 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -105,15 +105,30 @@ work, as calculating chunks is hard-coded to work on an uncompressed genome.
 BAM files will receive no processing, aside from optionally clipping
 overlapping read pairs that are otherwise double counted in ANGSD. Ensure any
 processing (duplicate removal, damage correction, realignment) you wish to
-include has already been performed. Note: Filtering is handled in ANGSD, so
-there is no need to filter out reads based on mapping or base quality, multi-
-mapping, etc. unless you wish to filter on a parameter not available to ANGSD.
+include has already been performed. Ideally, it would be good to clip the
+overlapping reads before running the workflow if you are supplying your own
+bams, and turn off clipping of user provided bams in the pipeline config. This
+prevents doubling the storage usage, as otherwise all your bams get duplicated,
+but with clipped reads. Doing this beforehand with [bamutil](https://genome.sph.umich.edu/wiki/BamUtil)
+is straightforward: `bam clipOverlap --in {input.bam} --out {output.bam}`.
+After this, you can remove the original bams to save storage.
+
+This pipeline is written with reusing of samples across datasets in mind. For
+samples starting at fastq, this should be done seamlessly by reusing samples
+in different datasets (as set in the config file) processed in the same working
+directory. This, in most cases, will also be true for bam input. However, if
+you are planning to process datasets where different bam files will be used
+as input for a sample given the same sample name and reference name in both
+dataset configs (for instance, to run the same samples with and without
+MapDamage rescaling), the second dataset will overwrite the bams of the first,
+and Snakemake will suggest rerunning the first dataset. In this case, it is
+best to have different sample names for the different input bams, or to run the
+two datasets in different working directories.
 
 Some QC will not be available for users starting at BAM files. No read
-processing QC can be produced, and should be done beforehand. While mapping
+processing QC can be produced and should be done beforehand. While mapping
 percentages are calculated, these may not entirely represent the truth, as they
-can't account for anything already removed, such as duplicates or unmapped
-reads.
+may not account for anything already fully removed from the bam.
 
 ### Running on a cluster
 

diff --git a/workflow/rules/2.0_mapping.smk b/workflow/rules/2.0_mapping.smk
@@ -264,7 +264,7 @@ rule symlink_userbams:
     input:
         bam=lambda w: units.loc[units["sample"] == w.sample, "bam"].values[0],
     output:
-        bam="results/datasets/{dataset}/bams/user_bams/{sample}.{ref}.user-processed.bam",
+        bam="results/mapping/user-provided-bams/{sample}.{ref}.user-processed.bam",
     log:
         "logs/{dataset}/symlink_bams/{sample}.{ref}.user-processed.log",
     conda:
@@ -278,11 +278,11 @@ rule symlink_userbams:
 rule bam_clipoverlap_userbams:
     """Clip overlapping reads in paired end bam files provided by users"""
     input:
-        bam="results/datasets/{dataset}/bams/user_bams/{sample}.{ref}.user-processed.bam",
+        bam="results/mapping/user-provided-bams/{sample}.{ref}.user-processed.bam",
         ref="results/ref/{ref}/{ref}.fa",
     output:
-        bam="results/datasets/{dataset}/bams/clipped_user_bams/{sample}.{ref}.clip.bam",
-        log="results/datasets/{dataset}/bams/clipped_user_bams/{sample}.{ref}.clipoverlap.stats",
+        bam="results/mapping/user-provided-bams/{sample}.{ref}.clip.bam",
+        log="results/mapping/user-provided-bams/{sample}.{ref}.clipoverlap.stats",
     log:
         "logs/mapping/bamutil/clipoverlap/{dataset}.{sample}.{ref}.log",
     benchmark:

diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk
@@ -379,12 +379,12 @@ def get_final_bam(wildcards):
             )
         if config["params"]["clipoverlap"]["clip_user_provided_bams"]:
             return {
-                "bam": "results/datasets/{dataset}/bams/clipped_user_bams/{sample}.{ref}.clip.bam",
-                "bai": "results/datasets/{dataset}/bams/clipped_user_bams/{sample}.{ref}.clip.bam.bai",
+                "bam": "results/mapping/user-provided-bams/{sample}.{ref}.clip.bam",
+                "bai": "results/mapping/user-provided-bams/{sample}.{ref}.clip.bam.bai",
             }
         return {
-            "bam": "results/datasets/{dataset}/bams/user_bams/{sample}.{ref}.user-processed.bam",
-            "bai": "results/datasets/{dataset}/bams/user_bams/{sample}.{ref}.user-processed.bam.bai",
+            "bam": "results/mapping/user-provided-bams/{sample}.{ref}.user-processed.bam",
+            "bai": "results/mapping/user-provided-bams/{sample}.{ref}.user-processed.bam.bai",
         }
     if (s in samples.index[samples.time == "historical"]) and (
         config["analyses"]["mapdamage_rescale"]