LEGACY_README

"Legacy" refers to the scripts used to generate data for Luquette et al. 2022.
Legacy behavior is not ideal in many ways, often because the legacy code was
restricted to analyzing small subsets of sites (especially hSNPs) due to speed.
Sometimes legacy behavior wasn't intended.

It is difficult to exactly reproduce legacy behavior using the new R package
interface because some files from the SCAN2 pipeline are not used by the R
package. This is OK--the output changes very slightly--but we need to define
the "correct" output of legacy mode somehow so that we can unit test.

Here are some notes about differences between real legacy output and the
output included in this package for regression testing.

* hSNP resampling did not set.seed to ensure reproducible subsampling. This
  was manually hacked in on 4/5/22 for a single run of h25/hunamp (with other
  changes in the next bullet point).
* hSNP resampling used the [phaser]/phased_hsnps.vcf file (which is shared
  by all single cells in a run) rather than the single-cell specific
  ab_model/[single cell]/hsnps.tab file. This file is not used by the R
  package, so we can't reproduce this exact behavior. Instead, I manually
  applied the `scansnv-pta` script for hSNP resampling to the hsnps.tab file,
  replaced the output in the pipeline, and continued the run.
* CIGAR strings were only extracted at somatic candidate positions (defined
  across all cells in a batch) and the downsampled hSNP spikein sites. Now,
  CIGAR strings are extracted at all positions in the GATK table. To make a
  somewhat comparable single file, I manually combined the hSNP spikein and
  somatic candidate CIGAR files.