-
Notifications
You must be signed in to change notification settings - Fork 3
/
LEGACY_README
27 lines (24 loc) · 1.63 KB
/
LEGACY_README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
"Legacy" refers to the scripts used to generate data for Luquette et al. 2022.
Legacy behavior is not ideal in many ways, often because the legacy code was
restricted to analyzing small subsets of sites (especially hSNPs) due to speed.
Sometimes legacy behavior wasn't intended.
It is difficult to exactly reproduce legacy behavior using the new R package
interface because some files from the SCAN2 pipeline are not used by the R
package. This is OK--the output changes very slightly--but we need to define
the "correct" output of legacy mode somehow so that we can unit test.
Here are some notes about differences between real legacy output and the
output included in this package for regression testing.
* hSNP resampling did not set.seed to ensure reproducible subsampling. This
was manually hacked in on 4/5/22 for a single run of h25/hunamp (with other
changes in the next bullet point).
* hSNP resampling used the [phaser]/phased_hsnps.vcf file (which is shared
by all single cells in a run) rather than the single-cell specific
ab_model/[single cell]/hsnps.tab file. This file is not used by the R
package, so we can't reproduce this exact behavior. Instead, I manually
applied the `scansnv-pta` script for hSNP resampling to the hsnps.tab file,
replaced the output in the pipeline, and continued the run.
* CIGAR strings were only extracted at somatic candidate positions (defined
across all cells in a batch) and the downsampled hSNP spikein sites. Now,
CIGAR strings are extracted at all positions in the GATK table. To make a
somewhat comparable single file, I manually combined the hSNP spikein and
somatic candidate CIGAR files.