KAML
is designed to predict genetic values using genome-wide or chromosome-wide SNPs for either simple traits that controlled by a limited number of major genes or complex traits that influenced by many polygenes with minor effects. In brief, KAML
provides a flexible assumption to accommodate traits of various genetic architectures and incorporates pseudo-QTNs as fixed effect terms and a trait-specific random effect term under the LMM framework. The model parameters are optimized using the information of bootstrap strategy based GWAS results in a parallel accelerated machine learning procedure combing cross-validation, grid search and bisection algorithms. For more statistical methods under the BLUP Framework please see our developed package HIBLUP
(https://www.hiblup.com).
KAML
is developed by Lilin Yin with the help of Haohao Zhang, and Xiaolei Liu* at the Huazhong(Central China) Agricultural University.
If you have any bug reports or questions, please feed back 👉here👈, or send email to xiaoleiliu@mail.hzau.edu.cn
Yin, L., Zhang, H., Zhou, X. et al. KAML: improving genomic prediction accuracy of complex traits using machine learning determined parameters. Genome Biol 21, 146 (2020). https://doi.org/10.1186/s13059-020-02052-w
📫 HIBLUP: Versatile and easy-to-use GS toolbox. | 🍀 SIMER: data simulation for life science and breeding. |
🏊 hibayes: A Bayesian-based GWAS and GS tool. | 🏔️ IAnimal: an omics knowledgebase for animals. |
📮 rMVP: Efficient and easy-to-use GWAS tool. | 📊 CMplot: A drawing tool for genetic analyses. |
KAML
is written in R language, it is recommended to link MKL (Math Kernel Library) or OpenBLAS with R for fast computing with big data (see how to link MKL with R), because the BLAS/LAPACK library can be accelerated automatically in multi-threads, which would significantly reduce computational time.
(1) KAML
can be installed with "devtools" by using the following R codes:
#if "devtools" isn't installed, please "install.packages(devtools)" first.
devtools::install_github("YinLiLin/KAML")
(2) If you get trouble in installing "devtools", try to install it locally as following:
# install required packages first
R> pkg <- c("RcppArmadillo", "RcppEigen", "RcppProgress", "bigmemory", "gaston", "RhpcBLASctl")
R> new.pkg <- pkg[!(pkg %in% installed.packages()[,"Package"])]
R> if(length(new.pkg)) install.packages(new.pkg)
shell$ git clone https://github.com/YinLiLin/KAML
shell$ R CMD INSTALL KAML
After installed successfully, the KAML
package can be loaded by typing
library(KAML)
Typing ?KAML
could get the details of all parameters. Some related functions (Data preparation) for KAML
could be refered to our developed GWAS package rMVP (https://github.com/XiaoleiLiuBio/rMVP).
The example data can be downloaded by typing:
wget https://github.com/YinLiLin/KAML/releases/download/1.0.1/example.zip
unzip example.zip
Or by clicking the example.zip in your browser. After downloaded, unzip the file and change the workplace by setwd("")
.
The file must contain a header row, which may represents the trait names. The missing values should be denoted by NA, which will be treated as candidates. Notice that only the numeric values are allowed and the characters will not be recognized. However, if a phenotype takes only values of 0, 1 (or only two levels), KAML
would consider it to be a case-control trait, and the predicted value could be directly interpreted as the probability of being a case.
mouse.Pheno.txt
Trait1 | Trait2 | Trait3 | Trait4 | Trait5 | Trait6 |
---|---|---|---|---|---|
0.224992 | 0.224991 | NA | 1 | NA | -0.285427 |
-0.974543 | -0.974542 | NA | 0 | NA | -2.333531 |
0.195909 | 0.195909 | NA | 1 | NA | 0.046818 |
NA | NA | NA | NA | NA | NA |
NA | NA | NA | NA | NA | NA |
... | ... | ... | ... | ... | ... |
NA | NA | NA | NA | NA | 0.720009 |
The Covariate file is optional, in order to fit the model for raw phenotype prediction, users can provide the covariates. Please attention that NAs are not allowed in the Covariate file, and all individuals should be in the same order with phenotype file. In KAML
, there are two types of covariates: dcovfile
and qcovfile
.
dcovfile
(optional): discrete covariates, e.g. dcov.txt. Each discrete covariate is recognized as a categorical factor with several levels. Each level can be denoted by a single character, a word, or a numerical number.
qcovfile
(optional): quantitative covariates, e.g. qcov.txt. Each quantitative covariate is recognized as a continuous variable.
NOTE: the design matrix of the mean, which is a vector of all ones, in the model is always a linear combination of the design matrix of a discrete covariate so that not all the effects of the levels (or classes, e.g. male and female) of a discrete covariate are estimable. KAML
will always constrain the effect of the first level to be zero and the effect of any other level represents its difference in effect compared to the first level.
dcov.txt |
qcov.txt |
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|
KAML
requires a n×n matrix that represents the relationship among individuals. It could be also supplied by the users, however, in this case, the order of individuals in either row or column should be the same as phenotype file, the column and row names are not needed.
mouse.Kin.txt
(optional)
0.3032 | -0.0193 | 0.0094 | 0.0024 | 0.0381 | ... | -0.0072 |
-0.0193 | 0.274 | -0.0243 | 0.0032 | -0.0081 | ... | 0.0056 |
0.0094 | -0.0243 | 0.3207 | -0.0071 | -0.0045 | ... | -0.0407 |
0.0024 | 0.0032 | -0.0071 | 0.321 | -0.008 | ... | -0.0093 |
0.0381 | -0.0081 | -0.0045 | -0.008 | 0.3498 | ... | -0.0238 |
... | ... | ... | ... | ... | ... | ... |
-0.0072 | 0.0056 | -0.0407 | -0.0093 | -0.0238 | ... | 0.3436 |
With the increasing number of SNPs, the genotype file becomes very big and it is not possible to read it into the memory directly with a memory-limited machine. Hence, KAML
is integrated with bigmemory
package, which designed a specific data format named big.matrix for saving the memory usage.
A total of two files should be provided: *.geno.bin
and *.geno.desc
, both files should use the same prefix. *.geno.bin
is the numeric m(number of markers) by n(number of individuals) genotype file in big.matrix format, and *.geno.desc
is the description file of *.geno.bin
. Actually, users could manually make those files, but time-consuming and error prone, so KAML
provides a function KAML.Data()
for genotype format transformation. In the released version, KAML
could accept the genotype file in four formats, including the Hapmap format, the VCF format, the PLINK Binary format, and the Numeric format. When transforming, missing genotypes are replaced by the selected methods (Left, Middle, "Right") of a given marker. After transformed, KAML
can read it on-the-fly without a memory attenuation.
NOTE: No matter what type of genotype format, the order of individuals in columns should be the same as phenotype file.
Hapmap is one of the commonly used data format for storing genotype. As the example shown below, the SNP information is stored in rows while the individual information is stored in columns. The first 11 columns showed attributes of the SNPs and the remaining columns show the nucleotides information that genotyped at each SNP for all individuals.
mouse.hmp.txt
rs# | alleles | chrom | pos | strand | assembly# | center | protLSID | assayLSID | panelLSID | QCcode | A048005080 | A048006063 | A048006555 | A048007096 | A048010273 | ... | A084292044 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rs3683945 | G/A | 1 | 3197400 | + | NA | NA | NA | NA | NA | NA | AG | AG | GG | AG | GG | ... | AA |
rs3707673 | A/G | 1 | 3407393 | + | NA | NA | NA | NA | NA | NA | GA | GA | AA | GA | AA | ... | GG |
rs6269442 | G/A | 1 | 3492195 | + | NA | NA | NA | NA | NA | NA | AG | GG | GG | AG | GG | ... | AA |
rs6336442 | G/A | 1 | 3580634 | + | NA | NA | NA | NA | NA | NA | AG | AG | GG | AG | GG | ... | AA |
rs13475699 | G | 1 | 3860406 | + | NA | NA | NA | NA | NA | NA | GG | GG | GG | GG | GG | ... | GG |
The genotype in Hapmap format can be transformed to big.matrix format by the following codes:
KAML.Data(hfile="mouse.hmp.txt", out="mouse")
If the genotype in Hapmap format is stored in multiple files for many chromosomes, it can be transformed to big.matrix format by the following codes:
KAML.Data(hfile=c("mouse.chr1.hmp.txt", "mouse.chr2.hmp.txt",...), out="mouse")
VCF (Variant Call Format) file has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project, it is one of the most widely used genotype format. An VCF file example is shown below:
##fileformat=VCFv4.2
##fileDate=20171105
##source=PLINKv1.90
##contig=<ID=1,length=2>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT -9_CZTB0004 -9_CZTB0006 -9_CZTB0008 -9_CZTB0010 -9_CZTB0011 -9_CZTB0012
1 1 10000235 A C . . PR GT 0/1 0/0 0/0 0/0 0/0 0/1
1 1 10000345 A G . . PR GT 0/0 0/0 0/0 0/0 1/1 1/1
1 1 10004575 G . . . PR GT 0/0 0/0 0/0 0/0 0/0 0/0
1 1 10006974 C T . . PR GT 0/0 0/0 0/1 1/1 0/1 1/1
1 1 10006986 A G . . PR GT 0/0 0/0 0/1 ./. 1/1 1/1
The genotype in VCF format can be transformed to big.matrix format by the following codes:
KAML.Data(vfile="mouse.vcf", out="mouse")
If the genotype in VCF format is stored in multiple files for many chromosomes, it can be transformed to big.matrix format by the following codes:
KAML.Data(vfile=c("mouse1.vcf", "mouse2.vcf",...), out="mouse")
The PLINK Banary format is derived from Plink software. This format requires three files: *.bed, *.bim and
*.fam, all with the same prefix. KAML
only use the *.bed and the *.bim file.
NOTE: the SNP ID must be unique.
mouse.fam
,mouse.bim
,mouse.bed
The genotype in PLINK Banary format can be transformed to big.matrix format by the following codes:
KAML.Data(bfile="mouse", out="mouse")
KAML
also accepts the Numeric format. The homozygote should be coded as 0, 2 while the heterozygote is coded as 1. The SNP information is stored in the rows and individual information is stored in the columns, it means that the dimension of the numeric matrix is m by n, the order of individuals in columns must correspond to the phenotype file in rows. Additionally, this format does not contain the chromosome and position of the SNPs. Therefore, two separate files should be provided, including one file contains the numeric genotype data, and the other file contains the position of each SNP.
NOTE: Row names and column names are not allowed, the number of row and the order of SNPs must be same in the two files.
mouse.Numeric.txt |
mouse.map |
||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
This type of file can be transformed by the following codes:
KAML.Data(numfile="mouse.Numeric.txt", mapfile="mouse.map", out="mouse")
If there are missing values in your genotype, please impute as following:
KAML.Impute("mouse")
After transformed from one of four types of format above, two needed files will be generated, all with the same prefix which is assigned by users in the KAML.Data
function.
The example mouse datasets:
mouse.geno.desc, mouse.geno.bin
To run KAML
, you should provide two basic files: the phenotype file (values for training, NAs for predictors) and the genotype file. By default, the first column of phenotype will be analyzed, if there are more than one trait, please specify which column of trait is to be analyzed with the parameter "pheno=". For example: KAML(..., pheno=3)
means the trait in third column would be analyzed. For the genotype, only the prefix need to be assigned, KAML
could automatically attach the *.geno.bin
and *.geno.desc
files.
Note again: KAML
has no function for adjusting the order of individuals. So please make sure the same order of individuals between phenotype and genotype.
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse")
# pfile: phenotype file
# pheno: the column number of predicted phenotype
# gfile: transformed genotype file
# multiple threads computation
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", cpu=30)
# cpu: number pf threads
Run KAML
with the provided covariate file cfile and kinship file kfile:
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", cfile="CV.txt", kfile="mouse.Kin.txt")
# cfile: covariates file
# kfile: kinship file
Set the sample number sample.num and validation number crv.num for cross_validation:
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", sample.num=2, crv.num=5)
# sample.num: the number of replicates on cross-validation
# crv.num: fold of cross-validation
Change the top selected number of SNPs Top.num and GWAS model (the options are "MLM", "GLM", "RR"):
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", Top.num=15, GWAS.model="MLM")
# Top.num: max number of top LD-filtered SNPs before pseudo QTNs optimization
# GWAS.model: select the model of Genome-Wide Association Study
Change the methods of variance components estimation vc.method (the options are "brent", "emma", "he"):
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", GWAS.model="MLM", vc.method="brent")
# vc.method: select the method of variance components estimation
Change the start value of grid search procedure of Kinship optimization Top.perc & Logx:
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", GWAS.model="MLM", vc.method="brent",
Top.perc=c(0.0001,0.001,0.01), Logx=c(0.01,0.05,0.1,0.5,1,5,10,15))
# Top.perc: prior value of the percentage of SNPs which would be given more weights
# Logx: prior value of the base of log function
# Note: More levels of start values will lead to much more calculation burden.
Change the maximum iteration number of bisection algorithm bisection.loop:
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", GWAS.model="MLM", vc.method="brent",
Top.perc=c(0.0001,0.001,0.01), Logx=c(0.01,0.05,0.1,0.5,1,5,10,15), bisection.loop=8)
# bisection.loop: the max number of iteration for bisection algorithm
# Note: if bisection.loop=0, the bisection procedure will not run.
Only to optimize a trait-specific (weighted) kinship matrix:
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", Top.num=NULL)
Only to add the selected pseudo QTNs with big effects as covariates:
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", Top.perc=NULL)
Switch KAML to LMM (GBLUP)
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", Top.num=NULL, Top.perc=NULL)
Integrate some previously validated causal SNPs of the trait as covariates directly:
# directly predict by LM (QTN)
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", prior.QTN=c(9358, 9375), prior.model="QTN")
# predict by MLM with QTNs and a Kinship optimization procedure (QTN + weighted Kinship)
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", prior.QTN=c(9358, 9375), prior.model="QTN+K")
# predict by MLM with QTNs and a Kinship (QTN + standard Kinship)
> mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", prior.QTN=c(9358, 9375),
prior.model="QTN+K", Top.perc=NULL)
In realistic analysis, we don't know its actual genetic architecture of unknow traits, which could be obtained from a machine learning strategy of KAML
. Although those procedures could be speeded up by parallel computation, it's still time-consuming with limited computation resources. So it would be a better choice to run KAML
within a smaller population to obtain the parameters, and then apply the optimized parameters to greater populations, which has been proved to be more efficient but generate similar prediction performance in our numbers of tests.
mykaml <- KAML(pfile="mouse.Pheno.txt", pheno=1, gfile="mouse", prior.QTN=c(9358, 9375), prior.model="QTN+K",
Top.perc=0.0276, Logx=3.1094)
Integrate KAML to SSBLUP (SSKAML)
The weighted Kinship matrix can be directly applied into SSBLUP model to improve the prediction accuracy for both genotyped and non-genotyped individuals. To run SSBLUP model, please install our developed tool HIBLUP.
library(hiblup)
library(KAML)
# extract the weighted kinship from KAML
G <- mykaml$K
G.ind <- read.table("mouse.Pheno.txt", head=F)[,1]
# fit SSBLUP
fit <- hiblup(pheno=pheno, pedigree=pedigree, geno=NULL, G=G, geno.ind=G.ind)
KAML
returns total 11 lists of results including: predicted phenotype, coefficients of all fixed effects, predicted GEBVs, genetic variance, residual variance, pseudo QTNs, the used model, the weights of all SNPs when constructing GRM, the optimized top percentage, the optimized base value of Log function and the optimized Kinship matrix.
> str(mykaml)
List of 11
$ y : num [1:1631, 1] 1541 1632 1568 1726 1749 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "t1"
$ beta : num 1631
$ gebv : num [1:1631, 1] -90.019 0.997 -63.775 94.4 117.305 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "t1"
$ vg : num 20665
$ ve : num 1773
$ qtn : NULL
$ model : chr "K"
$ weight : num [1:45426] 0.984 0.984 0.984 0.984 0.984 ...
$ top.perc: num 0.00775
$ logx : num 15.5
$ K : num [1:1631, 1:1631] 0.94824 -0.00781 -0.00558 -0.02665 -0.01755 ...
🆘 Question1: Failing to install "devtools":
ERROR: configuration failed for package ‘git2r’
removing ‘/Users/acer/R/3.4/library/git2r’
ERROR: dependency ‘git2r’ is not available for package ‘devtools’
removing ‘/Users/acer/R/3.4/library/devtools’
😋 Answer: Please type the following codes in terminal.
apt-get install libssl-dev/unstable
🆘 Question2: When installing packages from Github with "devtools", there is a error:
Error in curl::curl_fetch_disk(url, x$path, handle = handle): Problem with the SSL CA cert (path? access rights?)
😋 Answer: Please type the following codes and then try again.
library(httr)
set_config(config(ssl_verifypeer = 0L))
Questions, suggestions, and bug reports are welcome and appreciated. ➡️