Privacy Leakage through Inference across Genotypic HMM Trajectories
PLIGHT is a computational tool that employs a population-genetics-based Hidden Markov Model of recombination and mutation to find piecewise matches of a sparse query set of SNPs to a reference diploid genotype panel. The premise of the tool is that even limited, noisy and sparsely distributed genotypic information carries with it a certain risk of identification and downstream inference.
Inspired by imputation methods such as IMPUTE2 [1] and Eagle [2], the inference procedure in PLIGHT is based on the Li-Stephens model [3], where an HMM is used to explore the space of underlying pairs of haplotypes in a diploid genome with the possibility of de novo mutations and recombination between haplotypes. A solution to the inference problem consists of a set of best-fit haplotype pairs at each observed locus, each pair being linked to another pair at the next locus, to form a set of piecewise matches to reference haplotypes. If multiple equally likely solutions exist, the method identifies all of them. Collectively, these form a set of genotypic trajectories through reference haplotype space, where a trajectory is defined as a sequence of reference haplotype pairs (for a diploid genome) at each locus that best fit the observations.
For further details about the method and application cases, please refer to:
Emani, P.S.; Geradi, M.N.; Gürsoy, G.; Grasty, M.R.; Miranker, A.; Gerstein, M.B. Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases. Genome Res. 2023. 33: 2156-2173. doi: 10.1101/gr.278322.123
We tested the code in the Red Hat Enterprise Linux Server 7.9 (Maipo) environment. We have tested the batch job submission using the Slurm job handler. An example of a Slurm script for submission is included. We highly recommend the deployment of multiple CPUs in a job, as the code is designed to parallelize the HMM optimization.
The following external programs employed by the algorithms and need to be installed before running the code:
bcftools (version 1.10.2)
tabix (version 1.11)
All algorithms are written in Python 3, and the libraries/modules required in the corresponding Python scripts are:
numpy (several versions work, we have used 1.18 and 1.19 at different stages of development)
subprocess
gzip
mmap
multiprocessing
git clone https://github.com/gersteinlab/PLIGHT.git
The typical install time of PLIGHT itself is less than a minute. The dependencies will take longer to install, depending on the current configuration of the user's Python installation.
- The genotype vcf files used as the reference database can be downloaded from the 1000 Genomes Consortium: http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/
- The identification of related individuals is based on the 1000 Genomes Phase 3 Related Samples Panel: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/related_samples_vcf/related_samples_panel.20140910.ALL.panel.
- 1000 Genomes Phase 3 related individuals pedigree file https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped.
Typical run times for each algorithm: PLIGHT_Exact: ~20 minutes for 30 SNPs with 400 reference haplotypes on 12 cores.
PLIGHT_InRef: ~ A few seconds for 30 SNPs with up to ~5,000 reference haplotypes on 12 cores.
PLIGHT_Truncated: ~1 hour for 30 SNPs with 400 reference haplotypes on 12 cores.
PLIGHT_Iterative: Several hours/a few days for 20-30 iterations, 30 SNPs with ~5,000 reference haplotypes on 12 cores. The variability depends on the number of bootstrapping iterations and the degree to which a small number of optimal trajectories have been selected. For cases of many very common SNPs or where there is no significantly optimal trajectory, the code may not finish running in a reasonable amount of time. This is because at each step the backtrace will point to many possible haplotype pairs, and accounting for all of them is time- and memory-intensive.
The code consists of a set of three algorithms with special use cases:
- PLIGHT_Exact performs the exact HMM inference process using the Viterbi algorithm [4];
- PLIGHT_InRef is a specific individual-in-the-reference-database instance of PLIGHT_Exact, where the recombination rate is set to 0; that is, all SNPs are assumed to belong to one individual, and the most likely individual(s) in the reference database is(are) found.
- PLIGHT_Truncated phases in a process of truncating the set of all calculated trajectories to only those within a certain probability distance from the maximally optimal ones, resulting in a smaller memory footprint;
- PLIGHT_Iterative iteratively partitions the reference search space into more manageable blocks of haplotypes and runs PLIGHT_Exact on each block, followed by pooling and repetition of the scheme on the resulting, smaller cohort of haplotypes.
Below is an example of all the arguments that can be passed to the PLIGHT_Exact module:
usage: PLIGHT_Exact.py [-h] -c CHROMOSOMEFILE [-O OBSERVEDSAMPLE]
[-I CHROMOSOMEID] [-m METADATA] [-F GENFOLDER]
[-M MUTATIONRATE] [-e EFFPOP] [-r REFPOP]
[-b RECOMBRATE] [-s {distance,custom}] [-f RECOMBFILE]
[-t TOLERANCE] [-C CURRDIR] [--affilter AFFILTER]
[--numproc NUMPROC]
Identify closest related reference haplotypes
optional arguments:
-h, --help show this help message and exit
-c CHROMOSOMEFILE, --chromosomefile CHROMOSOMEFILE
Chromosome file name
-O OBSERVEDSAMPLE, --observedsample OBSERVEDSAMPLE
Observed Sample Genotype File
-I CHROMOSOMEID, --chromosomeID CHROMOSOMEID
Chromosome ID
-m METADATA, --metadata METADATA
Metadata file with ancestry information
-F GENFOLDER, --genfolder GENFOLDER
Genotype folder
-M THETAMUTATIONRATE, --thetamutationrate THETAMUTATIONRATE
Theta (Coalescent) Mutation rate
-L LAMBDAMUTATIONRATE, --lambdamutationrate LAMBDAMUTATIONRATE
Lambda (direct error) Mutation rate
-e EFFPOP, --effpop EFFPOP
Effective population
-r REFPOP, --refpop REFPOP
Reference population
-b RECOMBRATE, --recombrate RECOMBRATE
Recombination rate in cM/Mb (if /distance/ option is
chosen)
-s {distance,custom}, --recombswitch {distance,custom}
Recombination Model Switch (distance = simple linear
increase of recombination rate with distance, custom =
alternative model)
-f RECOMBFILE, --recombfile RECOMBFILE
Custom Recombination Model File
-t TOLERANCE, --tolerance TOLERANCE
Fraction of maximum value used as allowance for
inclusion of a path
-C CURRDIR, --currdir CURRDIR
Current working directory
--numproc NUMPROC Number of processes for parallelization
--posspecific POSSPECIFIC
Position-specific mutation rates included in
observation file? (True/False)
--prefix PREFIX String prefix to append to output Best_trajectories
file, in addition to chromosome number
Additional options for the other modules include
PLIGHT_Truncated.py
--truncation TRUNCATION
Fraction of trajectories carried forward from each
step
PLIGHT_Iterative.py
--subgroup SUBGROUP Number of haplotypes in each iterative scheme subgroup
--niter NITER Number of iterations of bootstrapping process
In the case of PLIGHT_InRef.py
, the recombination rate parameters are all ignored.
-
-c CHROMOSOMEFILE, --chromosomefile CHROMOSOMEFILE : This is the name of the reference database vcf file, either a composite of all chromosomes or separated out by chromosome; note that the code only runs the model for a single chromosome at a time.
-
-O OBSERVEDSAMPLE, --observedsample OBSERVEDSAMPLE : This file contains the set of observed SNPs in tab-delimited format (no header should be included) with columns of
Chromosome_Number Genome_Position Genome_Position Alternate_Allele Observed_Genotype (if POSSPECIFIC=True)Alternative_Genotype_1:Probability of Observing this Genotype (if POSSPECIFIC=True)Alternative_Genotype_2:Probability of Observing this Genotype
For example, let us assume we have a query SNP from chromosome 5 at position 33951693, with an observed alternate allele of
G
and called genotype of1
. If the user chooses to pass a position-specific error rate (as indicated by thePOSSPECIFIC
flag described below), with98%
probability of the called genotype and2%
probability of a genotype2
, they would enter the line as5 33951693 33951693 G 1 0:0.0 2:0.02
Note that in this example, the genotype of
0
is thought to have0
probability, and all the probabilities sum to 1. -
-I CHROMOSOMEID, --chromosomeID CHROMOSOMEID : This is the chromosome being analyzed, with the format 'chr<Chromosome Number>'. This is appended as a prefix to the output files.
-
-F GENFOLDER, --genfolder GENFOLDER : Folder location of the reference genotype vcf file, with a trailing '/' character
-
-M THETAMUTATIONRATE, --thetamutationrate MUTATIONRATE : This is the mutation rate per haplotype as used in the coalescent model. The resulting mutation rate
lambda
at a particular site is given bylambda = theta/(2(N + theta))
whereN
is the number of reference haplotypes. The option exists to omit this and directly pass the value forlambda
. -
-L LAMBDAMUTATIONRATE, --lambdamutationrate LAMBDAMUTATIONRATE : This is the
lambda
(direct error) Mutation rate. If the user intends to associate a general error rate due to genotyping or the possibility of de novo mutation at any site without reference to the coalescent model, this is the parameter to set. -
Note about mutation rates: In the absence of both the mutation rate parameters and when the
posspecific
parameter (see below) is set toFalse
, the mutation rates are set as follows:thetamutrate = (sum(1.0/i for i in range(1,2*refpop-1)))**(-1); lambdamutrate = 0.5*mutrate/(mutrate + 2*refpop)
. Ifposspecific=True
, the above mutation rate parameters are ignored, and the position-specific mutation rates in the observed sample file are used. -
-e EFFPOP, --effpop EFFPOP : Effective size of the human population. Default value = 11,418 [3].
-
-r REFPOP, --refpop REFPOP : Size of the reference population, i.e. the number of genotypes in the reference database (not the number of haplotypes).
-
-s {distance,custom}, --recombswitch {distance,custom} : Choose whether to use a linear growth in recombination rate with genomic distance, or a custom file of recombination values. If
custom
is chosen, provide a file for the recombination rate between adjacent SNPs (for L SNPs, there will be L-1 such recombination values). -
-b RECOMBRATE, --recombrate RECOMBRATE : If the
distance
option is chosen for the-s
flag, then provide the recombination rate incM/Mb
, i.e. centimorgans/Megabase. In the paper, we mainly used a value of 0.5 cM/Mb. This is set as the default value. -
-f RECOMBFILE, --recombfile RECOMBFILE : If the
custom
option is chosen for the-s
flag, provide the name of the file here. In thedistance
option, each value is set to4 * Effective Population Size * distance in Mb * Recombination rate in cM/Mb
. Thus, in thecustom
case, the user should set each recombination values between loci L and L+1 to4 * Effective Population Size * distance in cM between loci L and L+1
. -
-t TOLERANCE, --tolerance TOLERANCE : This is the tolerance factor that determines the cutoff for the sub-optimal paths to be included. That is, if the score of the best-fit trajectory is
S
, all paths with a scoreS - TOLERANCE*S
will be included. For the analyses in the paper, we chose a tolerance of 0.01. This is set as the default value. -
-C CURRDIR, --currdir CURRDIR : The current working directory to run the code. The default is set to './'.
-
--numproc NUMPROC : Number of processes for parallelization. The more the merrier. The default is set to 1.
-
--posspecific POSSPECIFIC : Are position-specific mutation rates included in observation file? (True/False). The default is 'False'.
-
--prefix PREFIX : String prefix to append to output Best_trajectories file, in addition to chromosome number. The default is '' (empty string).
-
--truncation TRUNCATION (Only in PLIGHT_Truncated): Fraction of trajectories carried forward from each step, after the phase-in period (see Supplementary Methods of the PLIGHT paper.
-
--subgroup SUBGROUP (Only in PLIGHT_Iterative): Number of haplotypes in each iterative scheme subgroup, where the subgroups are defined as the partitions into which the overall reference haplotype set is divided for the PLIGHT_Iterative run.
-
--niter NITER (Only in PLIGHT_Iterative): Number of iterations of bootstrapping process. The default is 20, though values of 30 were also considered in the paper.
An example of one run of the code for the 1000 Genomes Phase 3 reference database is:
python3 PLIGHT_Iterative.py --metadata integrated_call_samples_v3.20130502.ALL.panel --genfolder Genotypes/ --lambdamutationrate 0.1 --effpop 11418 --refpop 2504 --recombrate 0.5 --recombswitch distance --tolerance 0.01 --currdir ./ --numproc 20 --subgroup 300 --niter 1 --posspecific False -c ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -O chr3_Observed_SNPs.txt --chromosomeID chr3 --prefix Mutrate0.1_Subgroup300
The PLIGHT_Vis module carries out downstream analyses on the output of the HMM algorithms. This includes plotting the trajectories, and calculating the optimal haplotypes for each chromosome, as well as across all chromosomes under consideration. The help menu is as follows:
usage: PLIGHT_Vis.py [-h] [-C CHROMOSOMEIDS [CHROMOSOMEIDS ...]]
[-t TRAJECTORYPATTERN] [-T TRAJECTORYFOLDER]
[-P PLOTFOLDER]
Unravel the trajectories, find the identities of the best-fit reference
haplotypes, and plot the results
optional arguments:
-h, --help show this help message and exit
-C CHROMOSOMEIDS [CHROMOSOMEIDS ...], --chromosomeIDs CHROMOSOMEIDS [CHROMOSOMEIDS ...]
List of Chromosome IDs to be considered: format chr
followed by number, listed singly, separated by spaces
(for eg. -C chr1 chr2 chr19)
-t TRAJECTORYPATTERN, --trajectorypattern TRAJECTORYPATTERN
Pattern for trajectory files, where the chromosome ID
would fit into the curly braces
-T TRAJECTORYFOLDER, --trajectoryfolder TRAJECTORYFOLDER
Folder for storing the trajectory files
-P PLOTFOLDER, --plotfolder PLOTFOLDER
Folder for storing the plots
An example of running PLIGHT_Vis is provided below:
python3 PLIGHT_Vis.py -C chr3 chr6 -t InRef_{}_Best_trajectories.tsv -T Trajectories -P HMM_Plots
The PLIGHT_SanitizeGenotypes module is used to remove the single most informative SNP based on the trajectories from a previous iteration of PLIGHT. After the SNP is removed, a PLIGHT algorithm of the user's choosing is run on the remaining SNPs for comparison. The help menu is as follows:
usage: PLIGHT_SanitizeGenotypes.py [-h] --bftfile BFTFILE --algorithm_type
{Exact,Iterative,InReference} -c
CHROMOSOMEFILE -O OBSERVEDSAMPLE
[-I CHROMOSOMEID] [-m METADATA]
[-F GENFOLDER] [-M THETAMUTATIONRATE]
[-L LAMBDAMUTATIONRATE] [-e EFFPOP]
[-r REFPOP] [-b RECOMBRATE]
[-s {distance,custom}] [-f RECOMBFILE]
[-t TOLERANCE] [-C CURRDIR]
[--numproc NUMPROC] [--subgroup SUBGROUP]
[--niter NITER] [--posspecific POSSPECIFIC]
[--prefix PREFIX]
Remove most informative SNP and rerun PLIGHt algorithm of choice
optional arguments:
-h, --help show this help message and exit
--bftfile BFTFILE Best-fit trajectory file from previous run of PLIGHT
--algorithm_type {Exact,Iterative,InReference}
Version of PLIGHT to run the post-sanitization
protocol on (Truncated option not included)
-c CHROMOSOMEFILE, --chromosomefile CHROMOSOMEFILE
Chromosome file name
-O OBSERVEDSAMPLE, --observedsample OBSERVEDSAMPLE
Observed Sample Genotype File
-I CHROMOSOMEID, --chromosomeID CHROMOSOMEID
Chromosome ID
-m METADATA, --metadata METADATA
Metadata file with ancestry information
-F GENFOLDER, --genfolder GENFOLDER
Genotype folder
-M THETAMUTATIONRATE, --thetamutationrate THETAMUTATIONRATE
Theta (Coalescent) Mutation rate
-L LAMBDAMUTATIONRATE, --lambdamutationrate LAMBDAMUTATIONRATE
Lambda (direct error) Mutation rate
-e EFFPOP, --effpop EFFPOP
Effective population
-r REFPOP, --refpop REFPOP
Reference population
-b RECOMBRATE, --recombrate RECOMBRATE
Recombination rate in cM/Mb (if /distance/ option is
chosen)
-s {distance,custom}, --recombswitch {distance,custom}
Recombination Model Switch (distance = simple linear
increase of recombination rate with distance, custom =
alternative model)
-f RECOMBFILE, --recombfile RECOMBFILE
Custom Recombination Model File
-t TOLERANCE, --tolerance TOLERANCE
Fraction of maximum value used as allowance for
inclusion of a path
-C CURRDIR, --currdir CURRDIR
Current working directory
--numproc NUMPROC Number of processes for parallelization
--subgroup SUBGROUP Number of haplotypes in each iterative scheme subgroup
--niter NITER Number of iterations of bootstrapping process
--posspecific POSSPECIFIC
Position-specific mutation rates included in
observation file? (True/False)
--prefix PREFIX String prefix to append to output Best_trajectories
file, in addition to chromosome number
The parameters unique to this method (as opposed to the others which are the sae as for the PLIGHT HMM algorithms) are:
- --bftfile BFTFILE: This is the best-fit trajectory file from previous run of PLIGHT, which will be used in determining the most informative SNP.
- --algorithm_type {Exact,Iterative,InReference}: This is the algorithm to be run after the SNP has been removed.
The output of the method includes a sanitized SNP list with a .Sanitized
flag appended before the .txt extension.
An example of running PLIGHT_SanitizeGenotypes is provided below:
python3 PLIGHT_SanitizeGenotypes.py --bftfile Exact_chr1_Best_trajectories.tsv --algorithm_type 'Exact' --metadata integrated_call_samples_v3.20130502.ALL.panel --genfolder Genotypes/ --lambdamutationrate 0.1 --effpop 11418 --refpop 200 --recombrate 0.5 --recombswitch distance --tolerance 0.01 --currdir ./ --numproc 12 --posspecific False -c ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -O chr1_SNP_list.txt --chromosomeID chr1 --prefix Sanitize_Exact
We include examples of input SNP lists for both cases of (a) non-position-specific and (b) position-specific mutation rates. See the parameter descriptions above for explanations on how these two input files differ.
We provide an example of the best-fit trajectories files produced by the algorithms, where each line represents the pointer from the best-fit pair of haplotypes at the previous query SNP to the corresponding best-fit pair(s) at the current query SNP. Multiple possible best-fit pairs of haplotypes at the current query SNP are separated by semi-colons. The query SNPs are shown in reverse order as this is the manner in which the Viterbi algorithm identifies the optimal trajectories.
[1] Howie, B. N.; Donnelly, P.; Marchini, J. A Flexible and Accurate Genotype Imputation Method for the next Generation of Genome-Wide Association Studies. PLoS Genet. 2009, 5 (6). https://doi.org/10.1371/journal.pgen.1000529.
[2] Loh, P. R.; Danecek, P.; Palamara, P. F.; Fuchsberger, C.; Reshef, Y. A.; Finucane, H. K.; Schoenherr, S.; Forer, L.; McCarthy, S.; Abecasis, G. R.; et al. Reference-Based Phasing Using the Haplotype Reference Consortium Panel. Nat. Genet. 2016, 48 (11), 1443–1448. https://doi.org/10.1038/ng.3679.
[3] Li, N.; Stephens, M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics 2003, 165, 2213–2233.
[4] Viterbi, A. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Trans. Inf. Theory 1967, 13 (2), 260–269. https://doi.org/10.1109/TIT.1967.1054010