Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Polkas committed Feb 3, 2025
1 parent f17f021 commit b225f21
Show file tree
Hide file tree
Showing 2 changed files with 307 additions and 382 deletions.
107 changes: 58 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# miceFast <a href='https://github.com/polkas/miceFast'><img src='man/figures/miceFast_logo.png' align="right" width="200" /></a>
Maciej Nasinski

**Author**: Maciej Nasinski

[**Check the miceFast website for more details**](https://polkas.github.io/miceFast/index.html)

Expand All @@ -8,85 +9,93 @@ Maciej Nasinski
[![codecov](https://codecov.io/gh/Polkas/miceFast/branch/main/graph/badge.svg)](https://app.codecov.io/gh/Polkas/miceFast)
[![Dependencies](https://tinyverse.netlify.app/badge/miceFast)](https://cran.r-project.org/package=miceFast)

Fast imputations under the object-oriented programming paradigm.
Moreover there are offered a few functions built to work with popular R packages such as 'data.table' or 'dplyr'.
The biggest improvement in time performance could be achieve for a calculation where a grouping variable have to be used.
A single evaluation of a quantitative model for the multiple imputations is another major enhancement.
A new major improvement is one of the fastest predictive mean matching in the R world because of presorting and binary search.
## Overview

**miceFast** provides fast methods for imputing missing data, leveraging an object-oriented programming paradigm and optimized linear algebra routines.
The package includes convenient helper functions compatible with **data.table**, **dplyr**, and other popular R packages.

Performance benchmarks (check performance_validity.R file at extdata).
Major speed improvements occur when:
- Using a **grouping variable**, where the data is automatically sorted by group, significantly reducing computation time.
- Performing **multiple imputations**, by evaluating the underlying quantitative model only once for multiple draws.
- Running **Predictive Mean Matching (PMM)**, thanks to presorting and binary search.

[Advanced Usage - Vignette](https://polkas.github.io/miceFast/articles/miceFast-intro.html)
For performance details, see `performance_validity.R` in the `extdata` folder or the [Advanced Usage Vignette](https://polkas.github.io/miceFast/articles/miceFast-intro.html).

## Installation

You can install **miceFast** from CRAN:
```r
install.packages('miceFast')
install.packages("miceFast")
```

or

Or install the development version from GitHub:
```r
# install.packages("devtools")
devtools::install_github("polkas/miceFast")
```

**Recommended to download boosted BLAS library, even x100 faster:**
## Quick Example

- Linux users recommended to download Optimized BLAS (linear algebra) library: `sudo apt-get install libopenblas-dev`
- Apple vecLib BLAS:
```bash
cd /Library/Frameworks/R.framework/Resources/lib
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib
```

## Quick Implementation
Below is a short demonstration. See the [vignette](https://polkas.github.io/miceFast/articles/miceFast-intro.html) for advanced usage and best practices.

```r
library(miceFast)

set.seed(1234)
data(air_miss)

# plot NA structure
# Visualize the NA structure
upset_NA(air_miss, 6)

naive_fill_NA(air_miss)

# Check out the vignette for an advance usage
# There is required a thorough examination
# Simple and naive fill
imputed_data <- naive_fill_NA(air_miss)

# Other packages - popular simple solutions
# Compare with other packages:
# Hmisc
data.frame(Map(function(x) Hmisc::impute(x, 'random'), air_miss))
library(Hmisc)
data.frame(Map(function(x) Hmisc::impute(x, "random"), air_miss))

#mice
# mice
library(mice)
mice::complete(mice::mice(air_miss, printFlag = FALSE))

```

**Quick Reference Table**
---

| Function | Description |
|----------------------|----------------------|
| `new(miceFast)` | OOP instance with bunch of methods - check out vignette |
| `fill_NA()` | imputation - lda,lm_pred,lm_bayes,lm_noise |
| `fill_NA_N()` | multiple imputation - pmm,lm_bayes,lm_noise |
| `VIF()` | Variance inflation factor |
| `naive_fill_NA()` | auto imputations |
| `compare_imp()` | comparing imputations |
| `upset_NA()` | visualize NA structure - UpSetR::upset|
## Key Features

Summing up, `miceFast` offer a relevant reduction of a calculations time for:
- **Object-Oriented Interface** via `miceFast` objects (Rcpp modules).
- **Convenient Helpers**:
- `fill_NA()`: Single imputation (LDA, `lm_pred`, `lm_bayes`, `lm_noise`).
- `fill_NA_N()`: Multiple imputations (PMM, `lm_bayes`, `lm_noise`).
- `VIF()`: Variance Inflation Factor calculations.
- `naive_fill_NA()`: Automatic naive imputations.
- `compare_imp()`: Compare original vs. imputed values.
- `upset_NA()`: Visualize NA structure using [UpSetR](https://cran.r-project.org/package=UpSetR).

- Linear Discriminant Analysis around **(x5)**
- where a grouping variable have to be used **(around x10 depending on data dimensions and number of groups and even more than x100 although compared to data.table only a few k faster or even the same)** because of pre-sorting by grouping variable
- multiple imputations is faster around **x(a number of multiple imputations)** because the core of a model is evaluated only ones.
- Variance inflation factors (VIF) **(x5)** because the unnecessary linear regression is not evaluated - we need only inverse of X'X
- Predictive mean matching (PMM) **(x3)** because of pre-sorting and binary search (`mice` algorithm was improved too).
**Quick Reference Table**:

![](man/figures/g_summary.png)
| Function | Description |
|-----------------|-----------------------------------------------------------------------------|
| `new(miceFast)` | Creates an OOP instance with numerous imputation methods (see the vignette). |
| `fill_NA()` | Single imputation: LDA, `lm_pred`, `lm_bayes`, `lm_noise`. |
| `fill_NA_N()` | Multiple imputations (N repeats): PMM, `lm_bayes`, `lm_noise`. |
| `VIF()` | Computes Variance Inflation Factors. |
| `naive_fill_NA()` | Performs automatic, naive imputations. |
| `compare_imp()` | Compares imputations vs. original data. |
| `upset_NA()` | Visualizes NA structure using an UpSet plot. |

---

## Performance Highlights

Environment: R 4.2.1 Mac M1
Benchmark testing (on R 4.2, macOS M1) shows **miceFast** can significantly reduce computation time, especially in these scenarios:

- **Linear Discriminant Analysis (LDA)**: ~5x faster.
- **Grouping Variable Imputations**: ~10x faster (and can exceed 100x in some edge cases).
- **Multiple Imputations**: ~`x * (number of multiple imputations)` faster, since the model is computed only once.
- **Variance Inflation Factors (VIF)**: ~5x faster, because we only compute the inverse of X'X.
- **Predictive Mean Matching (PMM)**: ~3x faster, thanks to presorting and binary search.

![](man/figures/g_summary.png)

If you are interested about the procedure of testing performance and validity check performance_validity.R file at the extdata folder.
For more details, refer to the [`performance_validity.R` file](extdata/performance_validity.R) in the `extdata` folder.
Loading

0 comments on commit b225f21

Please sign in to comment.