README.Rmd

---
output: github_document
---

# grur <a href='https://thierrygosselin.github.io/grur/'><img src='man/figures/logo.png' align="right" height="139" /></a>


```{r, echo = FALSE}
description <- readLines("DESCRIPTION")
rvers <- stringr::str_match(grep("R \\(", description, value = TRUE), "[0-9]{1,4}\\.[0-9]{1,4}\\.[0-9]{1,4}")[1,1]
version <- gsub(" ", "", gsub("Version:", "", grep("Version:", description, value = TRUE)))
```

<!-- badges: start -->

[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://tidyverse.org/lifecycle/#experimental)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/grur)](http://cran.r-project.org/package=grur)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
[![minimal R version](https://img.shields.io/badge/R%3E%3D-`r rvers`-6666ff.svg)](https://cran.r-project.org/)
[![packageversion](https://img.shields.io/badge/Package%20version-`r version`-orange.svg)](commits/master)
[![Last-changedate](https://img.shields.io/badge/last%20change-`r gsub('-', '--', Sys.Date())`-brightgreen.svg)](/commits/master)
[![R-CMD-checks](https://github.com/thierrygosselin/grur/workflows/R-CMD-check/badge.svg)](https://github.com/thierrygosselin/grur/actions)
[![DOI](https://zenodo.org/badge/87596763.svg)](https://zenodo.org/badge/latestdoi/87596763)
<!-- badges: end -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
```


https://thierrygosselin.github.io/grur/

The name **grur** |ɡro͞oˈr| was chosen because the missing genotypes dilemma with
RADseq data reminds me of the cheese paradox. 

Here, I don't want to sustain a
[war](http://www.lefigaro.fr/flash-eco/2012/12/07/97002-20121207FILWWW00487-le-gruyere-francais-doit-avoir-des-trous.php)
or the controversy of cheese with holes,
so choose as you like, the *French Gruyère* or the *Swiss Emmental*.
The paradox is that the more cheese you have the more holes you'll get.
But, the more holes you have means the less cheese you have...
So, someone could conclude, the more cheese = the less cheese ?
I'll leave that up to you, back to genomics...

Numerous genomic analysis are vulnerable to missing values,
don't get trapped by missing genotypes in your RADseq dataset.

Use **grur** to **visualize patterns of missingness** and 
**perform map-independent imputations of missing genotypes** 
(see [features](https://github.com/thierrygosselin/grur#features) below).


## Installation
```r
if (!require("remotes")) install.packages("remotes")
remotes::install_github("thierrygosselin/grur")
library(grur)
```
Note: not all the packages used for imputations inside **grur** are installed automatically, why?

* Not all methods will be of interest. 
* Some modules used for imputations are more complicated to install, and depending
on OS, it will definitely test your R skills and patience.
* By default, you'll be able to run `grur::missing_visualization` to check for
pattern of missingness (the first step...)

## Installation details for additonal imputation options

Please follow additional instructions in the [vignette](http://thierrygosselin.github.io/grur/articles/rad_genomics_computer_setup.html) to install the required packages for the imputation options you want to conduct:

| imputation options | package | installation difficulty | install instructions |
|:---------|:---------:|:---------:|:---------|
|**imputation.method = "lightgbm"**|`lightgbm`|difficult|[vignette](http://thierrygosselin.github.io/grur/articles/rad_genomics_computer_setup.html)|
|**imputation.method = "xgboost"**|`xgboost`|moderate|[vignette](http://thierrygosselin.github.io/grur/articles/rad_genomics_computer_setup.html)|
|**imputation.method = "rf"**|`randomForestSRC`|moderate|[vignette](http://thierrygosselin.github.io/grur/articles/rad_genomics_computer_setup.html)|
|**imputation.method = "rf_pred"**|`ranger`|easy|`install.packages("ranger")`|
|**if using pmm > 0**|`missRanger`|easy|`install.packages("missRanger")`|

web site and additional info: [https://thierrygosselin.github.io/grur/](https://thierrygosselin.github.io/grur/)

* [Computer setup - installation - troubleshooting](http://thierrygosselin.github.io/grur/articles/rad_genomics_computer_setup.html)
* [Function's documentation](http://thierrygosselin.github.io/grur/reference/index.html)
* [grur's features](https://thierrygosselin.github.io/grur/index.html#features)
* [Vignettes](http://thierrygosselin.github.io/grur/articles/index.html)
* How to cite grur: inside R type `citation("grur")`

## Life cycle

**grur** is still experimental, in order to make the package better, changes are 
inevitable. Experimental functions will change, argument names will change.
Your codes and workflows might break from time to time **until grur is stable**. 
Consequently, depending on your tolerance to change, **grur** might not be for you.

* Philosophy, major changes and deprecated functions/arguments are documented in
life cycle section of functions. 
* [changelog, versions, new features and bug history](https://thierrygosselin.github.io/grur/news/index.html)
* [issues](https://github.com/thierrygosselin/grur/issues/new/choose) and [contributions](https://github.com/thierrygosselin/grur/issues/new/choose)

## Assumptions before imputing your dataset

1. **Filtered data**: Please don't try **grur** with raw data consisting of > 100K SNPs, 
you will generate all sorts of bias and you'll be disapointed. **Filter your data first!**
[radiator](https://thierrygosselin.github.io/radiator/) was designed for this.

2. **Correlations**: Machine learning algorithms will work better and 
faster if correlations are reduced to a minimum. If you used 
[filter_rad](https://thierrygosselin.github.io/radiator/reference/filter_rad.html)
to filter your dataset, you should be ok. If not, check your dataset for 
[short and long LD](https://thierrygosselin.github.io/radiator/reference/filter_ld.html).

3. **Pattern of individual heterozygosity**: If you have individual 
heterozygosity patterns and/or correlation of individual heterozygosity with 
missingness, you might want to skip imputation and go back to filter your data. 
Please check out [radiator::detect_mixed_genomes](https://thierrygosselin.github.io/radiator/reference/detect_mixed_genomes.html) and [radiator::detect_het_outliers](https://thierrygosselin.github.io/radiator/reference/detect_het_outliers.html).

4. **Patterns of missingness**: Look for patterns of missingness [(vignette)](https://thierrygosselin.github.io/grur/articles/vignette_missing_data_analysis.html) to better understand the reasons for their presence and tailor the arguments inside grur's imputation module.

5. **Default arguments**: Defaults are there for testing, please, please, please, don't use 
grur's defaults for publications!


## Features
| Caracteristics | Description |
|:-------------------|:--------------------------------------------------------|
|**Simulate RADseq data**|`simulate_rad`: simulate populations of RADseq data following island or stepping stone models. Inside the function, allele frequency can be created with [fastsimcoal2](http://cmpg.unibe.ch/software/fastsimcoal2/) and then used inside [rmetasim](https://github.com/stranda/rmetasim) simulation engine. *Vignette coming soon*.|
|**Patterns of missingness**|`missing_visualization`: visualize patterns of missing data associated with different variables of your study (lanes, chips, sequencers, populations, sample sites, reads/samples, homozygosity, etc). Similar to PLINK's identify-by-missingness analysis (IBM), **grur** is more powerful because it generates more analysis and automatically creates tables and figures ([see vignette](https://thierrygosselin.github.io/grur/articles/vignette_missing_data_analysis.html)). <br><br>`generate_missing`: allows to generate missing genotypes in dataset [simulated] based on a compound Dirichlet-multinomial distribution. *Vignette coming soon*.|
|**Imputations** `grur_imputations`|**Map-independent** imputations of missing genotypes with several algorithms (including machine leaning):<br>* **Random Forests** (on-the-fly-imputations with randomForestSRC or using predictive modelling using ranger and missRanger),<br>* **Extreme Gradient Tree Boosting** (using XGBoost or LightGBM),<br>* **Bayesian PCA** (using bpca in pcaMethods),<br>* **Classic Strawman:** the most frequently observed, non-missing, genotypes is used for imputation.<br><br>**Hierarchy:** algorithm's model can account for *strata* groupings, e.g. if patterns of missingness is found in the data.<br><br>**Haplotypes:** automatically detect SNPs on the same LOCUS (read/haplotype) to impute the SNPs jointly, reducing imputation artifacts. *Vignette coming soon*.|
| **Input/Output** | **grur** uses [radiator](https://thierrygosselin.github.io/radiator/index.html) input and output modules. Check out the [overview](https://thierrygosselin.github.io/radiator/articles/get_started.html#overview) of supported file format.|
|**[ggplot2](https://ggplot2.tidyverse.org)-based plotting**|Visualization: publication-ready figures of important metrics and statistics.|
|**Parallel**|Codes designed and optimized for fast computations with, sometimes, progress bars. Works with all OS: Linux, Mac and yes PC!|