Skip to content

Commit

Permalink
Merge pull request #34 from zjnolen/ngsf-hmm-pruning
Browse files Browse the repository at this point in the history
ngsF-HMM implementation improvements
  • Loading branch information
zjnolen authored Feb 15, 2024
2 parents 31e67a4 + 5279db6 commit 4a5239a
Show file tree
Hide file tree
Showing 11 changed files with 277 additions and 117 deletions.
17 changes: 12 additions & 5 deletions .test/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -130,20 +130,27 @@ params:
domajorminor: 1 # Currently only compatible with values 1, 2, 4, 5
domaf: 1
snp_pval: "1e-6"
min_maf: 0.05
min_maf: -1 # Set -1 to disable
mindepthind_heterozygosity: 3
ngsld:
max_kb_dist_est-ld: 200
max_kb_dist_decay: 100
max_kb_dist_pruning: 50
rnd_sample_est-ld: 0.001
max_kb_dist_decay: 100
rnd_sample_decay: 0.9
fit_LDdecay_extra: "--fit_level 3"
fit_LDdecay_n_correction: true
pruning_min-weight: 0.1
max_kb_dist_pruning_dataset: 50 # used for PCA, Admix, NGSrelate, not ngsF-HMM
pruning_min-weight_dataset: 0.1 # used for PCA, Admix, NGSrelate, not ngsF-HMM
ngsf-hmm:
estimate_in_pops: false
prune: false
max_kb_dist_pruning_pop: 50 # used only for ngsf-hmm
pruning_min-weight_pop: 0.8 # used only for ngsf-hmm
min_roh_length: 50000
roh_bins: [ 100000, 250000, 500000, 1000000, 2500000, 5000000 ] # in ascending order, bins are between adjacent values and between min_roh_length and value 1 and the last value and infinity
realsfs:
fold: 1 # Should only be set to 1, unless ancestral reference is given above
sfsboot: 100
sfsboot: 100 # Does nothing for now
fst:
whichFst: 1
win_size: 10000
Expand Down
50 changes: 41 additions & 9 deletions config/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -451,17 +451,16 @@ or a pull request and I'll gladly put it in.
for other options. I have not tested much beyond `1` and `8`, please open
an issue if you have problems.
- `min_maf:` The minimum minor allele frequency required to call a SNP.
(float, [docs](http://www.popgen.dk/angsd/index.php/Allele_Frequencies))
This is set when generating the beagle file, so will filter SNPs for
PCAngsd, NGSadmix, ngsF-HMM, and NGSrelate. If you would like each tool
to handle filtering for maf on its own you can set this to `-1`
(disabled). (float, [docs](http://www.popgen.dk/angsd/index.php/Allele_Frequencies))
- `ngsld:` Settings for ngsLD ([docs](https://github.com/fgvieira/ngsLD))
- `max_kb_dist_est-ld:` For the LD estimates generated when setting
`estimate_ld: true` above, set the maximum distance between sites in kb
that LD will be estimated for (`--max_kb_dist` in ngsLD, integer)
- `max_kb_dist_decay:` The same as `max_kb_dist_est-ld:`, but used when
estimating LD decay when setting `ld_decay: true` above (integer)
- `max_kb_dist_pruning:` The same as `max_kb_dist_est-ld:`, but used when
linkage pruning SNPs as inputs for PCA, Admix, and Inbreeding analyses.
Any positions above this distance will be assumed to be in linkage
equilibrium during the pruning process (integer)
- `rnd_sample_est-ld:` For the LD estimates generated when setting
`estimate_ld: true` above, randomly sample this proportion of pairwise
linkage estimates rather than estimating all (`--rnd_sample` in ngsLD,
Expand All @@ -474,8 +473,41 @@ or a pull request and I'll gladly put it in.
size corrected r^2 model be used? (`true`/`false`, `true` is the
equivalent of passing a sample size to `fit_LDdecay.R` in ngsLD using
`--n_ind`)
- `pruning_min-weight:` The minimum r^2 to assume two positions are in
linkage disequilibrium when pruning (float)
- `max_kb_dist_pruning_dataset:` The same as `max_kb_dist_est-ld:`, but
used when linkage pruning SNPs as inputs for PCAngsd, NGSadmix, and
NGSrelate analyses. Pruning is performed on the whole dataset. Any
positions above this distance will be assumed to be in linkage
equilibrium during the pruning process. (integer)
- `pruning_min-weight_dataset:` The minimum r^2 to assume two positions are
in linkage disequilibrium when pruning for PCAngsd, NGSadmix, and
NGSrelate analyses. (float)
- `ngsf-hmm:` Settings for ngsF-HMM
- `estimate_in_pops:` Set to `true` to run ngsF-HMM separately for each
population in your dataset. Set to `false` to run for whole dataset at
once. ngsF-HMM assumes Hardy-Weinberg Equilibrium (aside from inbreeding)
in the input data, so select the option that most reflects this. You can
use PCA and Admixture analyses to help determine this. (`true`/`false`)
- `prune:` Whether or not to prune SNPs for LD before running the analysis.
ngsF-HMM assumes independent sites, so it is preferred to set this to
`true` to satisfy that expectation. (`true`/`false`)
- `max_kb_dist_pruning_pop:` The maximum distance between sites in kb
that will be treated as in LD when pruning for the ngsF-HMM input. (INT)
- `pruning_min-weight_pop:` The minimum r^2 to assume two positions are in
linkage disequilibrium when pruning for the ngsF-HMM input. Note, that
this likely will be substantially higher for individual populations than
for the whole dataset, as background LD should be higher when no
substructure is present. (float)
- `min_roh_length:` Minimum ROH size in base pairs to include in inbreeding
coefficient calculation. Set if short ROH might be considered low
confidence for your data. (integer)
- `roh_bins:` A list of integers that describe the size classes in base
pairs you would like to partition the inbreeding coefficient by. This can
help visualize how much of the coefficient comes from ROH of certain size
classes (and thus, ages). List should be in ascending order and the first
entry should be greater than `min_roh_length`. The first bin will group
ROH between `min_roh_length` and the first entry, subsequent bins will
group ROH with sizes between adjacent entries in the list, and the final
bin will group all ROH larger than the final entry in the list. (list)
- `realSFS:` Settings for realSFS
- `fold:` Whether or not to fold the produced SFS. Set to 1 if you have not
provided an ancestral-state reference (0 or 1, [docs](http://www.popgen.dk/angsd/index.php/SFS_Estimation))
Expand Down Expand Up @@ -508,6 +540,6 @@ or a pull request and I'll gladly put it in.
assessment. (integer)
- `extra:` Additional arguments to pass to NGSadmix (for instance,
increasing `-maxiter`). (string, [docs](http://www.popgen.dk/software/index.php/NgsAdmix))
- `ibs:` Settings for identity by state calculation with ANGSD
- `-doIBS:` Whether to use a random (1) or consensus (2) base in IBS
- `ibs:` Settings for identity by state calculation with ANGSD
- `-doIBS:` Whether to use a random (1) or consensus (2) base in IBS
distance calculation ([docs](http://www.popgen.dk/angsd/index.php/PCA_MDS))
17 changes: 12 additions & 5 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -129,20 +129,27 @@ params:
domajorminor: 1 # Currently only compatible with values 1, 2, 4, 5
domaf: 1
snp_pval: "1e-6"
min_maf: 0.05
min_maf: -1 # Set -1 to disable
mindepthind_heterozygosity: 3
ngsld:
max_kb_dist_est-ld: 4000
max_kb_dist_decay: 100
max_kb_dist_pruning: 100
rnd_sample_est-ld: 0.001
max_kb_dist_decay: 100
rnd_sample_decay: 0.001
fit_LDdecay_extra: "--fit_level 100 --fit_boot 100 --plot_size 3,6 --plot_data --plot_no_legend --plot_y_lim 0.5"
fit_LDdecay_n_correction: true
pruning_min-weight: 0.1
max_kb_dist_pruning_dataset: 100 # used for PCA, Admix, NGSrelate, not ngsF-HMM
pruning_min-weight_dataset: 0.1 # used for PCA, Admix, NGSrelate, not ngsF-HMM
ngsf-hmm:
estimate_in_pops: true
prune: true
max_kb_dist_pruning_pop: 100 # used only for ngsf-hmm
pruning_min-weight_pop: 0.8 # used only for ngsf-hmm
min_roh_length: 50000
roh_bins: [ 100000, 250000, 500000, 1000000, 2500000, 5000000 ] # in ascending order, bins are between adjacent values and between min_roh_length and value 1 and the last value and infinity
realsfs:
fold: 1 # Should only be set to 1, unless ancestral reference is given above
sfsboot: 100
sfsboot: 100 # Does nothing for now
fst:
whichFst: 1
win_size: 50000
Expand Down
4 changes: 2 additions & 2 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ if config["analyses"]["heterozygosity_angsd"]:

if config["analyses"]["inbreeding_ngsf-hmm"]:
all_outputs.append(
"results/datasets/{dataset}/plots/inbreeding/{dataset}.{ref}_all_{sites}-filts.froh.pdf"
"results/datasets/{dataset}/plots/inbreeding/{dataset}.{ref}_all_{sites}-filts.ind_froh.tsv"
)

if config["analyses"]["admix_ngsadmix"]:
Expand Down Expand Up @@ -359,7 +359,7 @@ if config["subsample_analyses"]["heterozygosity_angsd"]:
if config["subsample_analyses"]["inbreeding_ngsf-hmm"]:
all_outputs.extend(
expand(
"results/datasets/{{dataset}}/plots/inbreeding/{{dataset}}.{{ref}}_all{dp}_{{sites}}-filts.froh.pdf",
"results/datasets/{{dataset}}/plots/inbreeding/{{dataset}}.{{ref}}_all{dp}_{{sites}}-filts.ind_froh.tsv",
dp=subdp,
)
)
Expand Down
34 changes: 14 additions & 20 deletions workflow/rules/4.1_linkage_pruning.smk
Original file line number Diff line number Diff line change
Expand Up @@ -7,29 +7,23 @@ rule ngsLD_prune_sites:
Prunes SNPs to produce a list of SNPs in linkage equilibrium.
"""
input:
ld=expand(
"results/datasets/{{dataset}}/beagles/pruned/ngsLD/{{dataset}}.{{ref}}_{{population}}{{dp}}_chunk{{chunk}}_{{sites}}-filts.ld_maxkbdist-{maxkb}_rndsample-1.gz",
maxkb=config["params"]["ngsld"]["max_kb_dist_pruning"],
),
ld="results/datasets/{dataset}/beagles/pruned/ngsLD/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.ld_maxkbdist-{maxkb}_rndsample-1.gz",
output:
sites="results/datasets/{dataset}/beagles/pruned/ngsLD/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts_pruned.sites",
sites="results/datasets/{dataset}/beagles/pruned/ngsLD/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.sites",
log:
"logs/{dataset}/ngsLD/prune_sites/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.log",
"logs/{dataset}/ngsLD/prune_sites/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts_maxkbdist-{maxkb}_minr2-{r2}.log",
benchmark:
"benchmarks/{dataset}/ngsLD/prune_sites/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.log"
"benchmarks/{dataset}/ngsLD/prune_sites/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts_maxkbdist-{maxkb}_minr2-{r2}.log"
container:
ngsld_container
threads: 4
resources:
runtime="10d",
params:
maxdist=lambda w: str(config["params"]["ngsld"]["max_kb_dist_pruning"]) + "000",
minweight=config["params"]["ngsld"]["pruning_min-weight"],
shell:
"""
if [ -s {input.ld} ]; then
zcat {input.ld} | prune_graph --weight-field 'column_7' \
--weight-filter 'column_3 <= {params.maxdist} && column_7 >= {params.minweight}' \
--weight-filter 'column_3 <= {wildcards.maxkb}000 && column_7 >= {wildcards.r2}' \
--out {output.sites}
else
> {output.sites}
Expand All @@ -43,20 +37,20 @@ rule prune_chunk_beagle:
"""
input:
beagle="results/datasets/{dataset}/beagles/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.beagle.gz",
sites="results/datasets/{dataset}/beagles/pruned/ngsLD/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts_pruned.sites",
sites="results/datasets/{dataset}/beagles/pruned/ngsLD/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.sites",
output:
prunedgz="results/datasets/{dataset}/beagles/pruned/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts_pruned.beagle.gz",
prunedgz="results/datasets/{dataset}/beagles/pruned/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle.gz",
log:
"logs/{dataset}/ngsLD/prune_beagle/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.log",
"logs/{dataset}/ngsLD/prune_beagle/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.log",
benchmark:
"benchmarks/{dataset}/ngsLD/prune_beagle/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.log"
"benchmarks/{dataset}/ngsLD/prune_beagle/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.log"
conda:
"../envs/shell.yaml"
shadow:
"minimal"
threads: lambda wildcards, attempt: attempt
params:
pruned="results/datasets/{dataset}/beagles/pruned/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts_pruned.beagle",
pruned="results/datasets/{dataset}/beagles/pruned/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle",
resources:
runtime=lambda wildcards, attempt: attempt * 120,
shell:
Expand Down Expand Up @@ -84,15 +78,15 @@ rule merge_pruned_beagles:
"""
input:
pruned=lambda w: expand(
"results/datasets/{{dataset}}/beagles/pruned/chunks/{{dataset}}.{{ref}}_{{population}}{{dp}}_chunk{chunk}_{{sites}}-filts_pruned.beagle.gz",
"results/datasets/{{dataset}}/beagles/pruned/chunks/{{dataset}}.{{ref}}_{{population}}{{dp}}_chunk{chunk}_{{sites}}-filts.pruned_maxkbdist-{{maxkb}}_minr2-{{r2}}.beagle.gz",
chunk=chunklist,
),
output:
beagle="results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts_pruned.beagle.gz",
beagle="results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle.gz",
log:
"logs/{dataset}/ngsLD/merge_pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts_merge_pruned.log",
"logs/{dataset}/ngsLD/merge_pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts_merge.pruned_maxkbdist-{maxkb}_minr2-{r2}.log",
benchmark:
"benchmarks/{dataset}/ngsLD/merge_pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts_merge_pruned.log"
"benchmarks/{dataset}/ngsLD/merge_pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts_merge.pruned_maxkbdist-{maxkb}_minr2-{r2}.log"
wildcard_constraints:
population="|".join(
["all"]
Expand Down
6 changes: 5 additions & 1 deletion workflow/rules/5.0_relatedness.smk
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,11 @@ rule ngsrelate:
Estimates inbreeding and relatedness measures using NGSrelate.
"""
input:
beagle="results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_all{dp}_{sites}-filts_pruned.beagle.gz",
beagle=expand(
"results/datasets/{{dataset}}/beagles/pruned/{{dataset}}.{{ref}}_all{{dp}}_{{sites}}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle.gz",
maxkb=config["params"]["ngsld"]["max_kb_dist_pruning_dataset"],
r2=config["params"]["ngsld"]["pruning_min-weight_dataset"],
),
inds="results/datasets/{dataset}/poplists/{dataset}_all.indiv.list",
output:
relate="results/datasets/{dataset}/analyses/kinship/ngsrelate/{dataset}.{ref}_all{dp}_{sites}-filts_relate.tsv",
Expand Down
14 changes: 9 additions & 5 deletions workflow/rules/6.0_pca.smk
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ rule remove_excl_pca_admix:
results), while allowing them in all other analyses.
"""
input:
"results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_all{dp}_{sites}-filts_pruned.beagle.gz",
"results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_all{dp}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle.gz",
output:
"results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_all_excl_pca-admix{dp}_{sites}-filts_pruned.beagle.gz",
"results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_all_excl_pca-admix{dp}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle.gz",
log:
"logs/{dataset}/ngsLD/excl_pca_admix_beagle/{dataset}.{ref}_all_excl_pca-admix{dp}_{sites}-filts.log",
"logs/{dataset}/ngsLD/excl_pca_admix_beagle/{dataset}.{ref}_all_excl_pca-admix{dp}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.log",
benchmark:
"benchmarks/{dataset}/ngsLD/excl_pca_admix_beagle/{dataset}.{ref}_all_excl_pca-admix{dp}_{sites}-filts.log"
"benchmarks/{dataset}/ngsLD/excl_pca_admix_beagle/{dataset}.{ref}_all_excl_pca-admix{dp}_{sites}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.log"
conda:
"../envs/shell.yaml"
params:
Expand All @@ -30,7 +30,11 @@ rule pca_pcangsd:
Produces covariance matrix from SNP genotype likelihood data with PCAngsd.
"""
input:
beagle="results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts_pruned.beagle.gz",
beagle=expand(
"results/datasets/{{dataset}}/beagles/pruned/{{dataset}}.{{ref}}_{{population}}{{dp}}_{{sites}}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle.gz",
maxkb=config["params"]["ngsld"]["max_kb_dist_pruning_dataset"],
r2=config["params"]["ngsld"]["pruning_min-weight_dataset"],
),
output:
cov="results/datasets/{dataset}/analyses/pcangsd/{dataset}.{ref}_{population}{dp}_{sites}-filts.cov",
log:
Expand Down
12 changes: 10 additions & 2 deletions workflow/rules/6.1_admixture.smk
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,11 @@ rule ngsAdmix:
replicates have been performed.
"""
input:
beagle="results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts_pruned.beagle.gz",
beagle=expand(
"results/datasets/{{dataset}}/beagles/pruned/{{dataset}}.{{ref}}_{{population}}{{dp}}_{{sites}}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle.gz",
maxkb=config["params"]["ngsld"]["max_kb_dist_pruning_dataset"],
r2=config["params"]["ngsld"]["pruning_min-weight_dataset"],
),
output:
qopt="results/datasets/{dataset}/analyses/ngsadmix/{dataset}.{ref}_{population}{dp}_{sites}-filts_K{kvalue}.qopt",
fopt="results/datasets/{dataset}/analyses/ngsadmix/{dataset}.{ref}_{population}{dp}_{sites}-filts_K{kvalue}.fopt.gz",
Expand Down Expand Up @@ -68,7 +72,11 @@ rule evalAdmix:
Perform evalAdmix analysis to get residuals on best fitting replicate for a given K.
"""
input:
beagle="results/datasets/{dataset}/beagles/pruned/{dataset}.{ref}_{population}{dp}_{sites}-filts_pruned.beagle.gz",
beagle=expand(
"results/datasets/{{dataset}}/beagles/pruned/{{dataset}}.{{ref}}_{{population}}{{dp}}_{{sites}}-filts.pruned_maxkbdist-{maxkb}_minr2-{r2}.beagle.gz",
maxkb=config["params"]["ngsld"]["max_kb_dist_pruning_dataset"],
r2=config["params"]["ngsld"]["pruning_min-weight_dataset"],
),
qopt="results/datasets/{dataset}/analyses/ngsadmix/{dataset}.{ref}_{population}{dp}_{sites}-filts_K{kvalue}.qopt",
fopt="results/datasets/{dataset}/analyses/ngsadmix/{dataset}.{ref}_{population}{dp}_{sites}-filts_K{kvalue}.fopt.gz",
output:
Expand Down
Loading

0 comments on commit 4a5239a

Please sign in to comment.