See methods section for summary of each inbreeding metric, including other aliases, mathematical equations etc.
Vcftools v.1.14 --het calculates a measure of heterozygosity on a per-individual basis. Specfically, the inbreeding coefficient (FH) is estimated for each individual using method of moments
For more details, see:
https://vcftools.github.io/man_latest.html
https://www.cog-genomics.org/plink/1.9/basic_stats#ibc
vcftools --vcf filtered.vcf --het
plink --vcf filtered.vcf --allow-extra-chr --double-id --out filtered_test
plink --bfile filtered_test --recode --tab --allow-extra-chr --out filtered_test
plink --bfile filtered_test --allow-extra-chr --het --ibc --out filtered_test
PLINK --het computes observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates
PLINK --ibc (ported from GCTA) calculates three inbreeding coefficients for each sample. Briefly, Fhat1 is the usual variance-standardized relationship minus 1, Fhat2 is approximately equal to the --het estimate, and Fhat3 is based on the correlation between uniting gametes.
InbreedR: standardised multilocus heterozygosity (sMLH), identity disequilibrium (g2), and heterozygosity-heterozygosity correlation coefficients
"Using SNP markers, it is possible to measure variance in inbreeding through the strength of correlation in heterozygosity across marker loci, termed identity disequilibrium (ID). ID can be quantified using the measure g2, which is also a central parameter in HFC theory that can be used within a wider framework to estimate the direct impact of inbreeding on both marker heterozygosity and fitness."(Stoffel et al. 2016).
For more details, see:
https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.12588
https://cran.r-project.org/web/packages/inbreedR/vignettes/inbreedR_step_by_step.html
library(vcfR)
library(reshape2)
vcf_file <- "filtered.vcf
vcf <- read.vcfR(vcf_file, verbose = FALSE )
gt <- extract.gt(vcf)
gt <- as.data.frame(t(gt), stringsAsFactors = FALSE)
gt[gt == "."] <- NA
snp_geno <- do.call(cbind, apply(gt, 2, function(x) colsplit(x, "/", c("a","b"))))
kakapo_snp_genotypes <- inbreedR::convert_raw(snp_geno)
library(inbreedR)
check_data(kakapo_snp_genotypes)
library(inbreedR)
het <- sMLH(kakapo_snp_genotypes)
het
write.table(het, "sMLH_kakapo.txt", sep="\t")
plot(het)
plot(het, main = "sMLH",
col = "cornflowerblue", cex.axis=0.95)
het_var <- var(het)
het_var
plot(het_var, main = "Variance in sMLH", col = "cornflowerblue", cex.axis=0.95)
g2_kakapo_snps <- g2_snps(kakapo_snp_genotypes, nperm = 100, nboot = 10, CI = 0.95, parallel = FALSE, ncores = NULL)
g2_kakapo_snps
plot(g2_kakapo_snps, main = "Identity disequilibrium", col = "cornflowerblue", cex.axis=0.95)
HHC_kakapo_snps <- HHC(kakapo_snp_genotypes, reps = 100)
HHC_kakapo_snps
plot(HHC_kakapo_snps, main = "Heterozygosity-heterozygosity correlations", col = "cornflowerblue", cex.axis=0.85)
hf <- r2_hf(genotypes = kakapo_snp_genotypes, nboot = 100, type = "snps", parallel = FALSE)
hf
plot(hf)
plot(hf, main = "r2 bootstrapping distribution - estimated r2 with CI", col = "cornflowerblue", cex.axis=0.85)
"Unbiased estimates of relatedness can be obtained by using only those SNPs with genotype calls in both individuals. The expected value of this estimator is independent of the SNP depth in each individual, under a model of genotype calling that includes the special case of the two alleles being read at random" (Dodds et al. 2015)
In contrast, the estimator of self-relatedness (diagonal of GRM = Fgrm) does depend on the SNP depth, providing unbiased estimates of self-relatedness
For more details, see:
https://github.com/AgResearch/KGD
https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2252-3
python2.7 vcf2ra.py filtered.vcf
genofile <- "samples.vcf.ra.tab"
gform <- "Tassel"
source("GBS-Chip-Gmatrix.R")
Gfull <- calcG()
write.table(Gfull$G5,"G5.txt",sep="\t")
write.table(Gfull_heatmap$G5, "G5_heatmap.txt", sep="\t")
In Excel, extract diagonal of G5 matrix for self-relatedness estimates, Fgrm = diagonal elements of matrix minus one
=INDEX(A1:E1,,ROWS($1:1))
Fgrm = diagonal of matrix - 1
For more details, see:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5775499/pdf/EVA-11-123.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4815495/pdf/hdy201517a.pdf
1028666.775/12241 = 84.03453762 kb/SNPs, then PLINK parameter density (according to Kardos et al.) is real SNP density x 1.5
plink --vcf filtered.vcf --allow-extra-chr --double-id --out filtered_test
plink --bfile filtered_test --recode --tab --allow-extra-chr --out filtered_test
plink --file filtered_roh_80 --allow-extra-chr --homozyg --homozyg-snp 25 --homozyg-het 1 --homozyg-density 130 --homozyg-gap 1000 --homozyg-window-snp 25 --homozyg-window-het 1 --homozyg-window-missing 5 --homozyg-window-threshold 0.05 --out roh_130d
#SNP Density 130 kb/SNP, >25 SNPs per ROH #--homozyg: Scan complete, found 9372 ROH
####ADDITIONAL PARAMETER TESTING: in addition to calculating SNP density, the total length of RoH (Mb, from .hom.indv output) plotted against frequency of missing data per individual (vcftools --missing-indv), to see effect of missing data on calling of RoH
####130 DENSITY IS MOST SUITABLE: doesn't inflate individuals with larger missing data i.e. individuals with missing data aren't calling more/less RoH than those with more markers
####Allowing up to three heterozygous sites (--homozyg-het 3) as suggested in Ceballos et al.(2018), did not impact the number of RoH found.
############################