missing-model.qmd

---
pagetitle: "Feature Engineering A-Z | Model Based Imputation"
---

# Model Based Imputation {#sec-missing-model}

::: {style="visibility: hidden; height: 0px;"}
## Model Based Imputation
:::

In model-based imputation, this is where we get the remaining types of imputation that we can use.
It is quite a big and broad topic.
This chapter will try to do it justice.

We start with simpler methods.
Remember, this chapter specifically refers to methods where more than one variable is being used for the imputation.
So we could do grouped versions of the simple imputation methods seen in @sec-missing-simple.
Instead of imputing with the mean, you impute with the mean within a given group as defined by another categorical variable.

You could also fit a linear regression model with the target variable as the variable you intend to impute, and other complete variables as predictors.

![linear imputation in action. The left-hand side shows the linear fit between a predictor and the target variable. Missing values are shown along the x-axis. The right-hand side shows how the missing values are being imputed using the linear fit.](diagrams/missing-impute-linear.png){#fig-missing-impute-linear}

This idea will extend into most other types of models.
K-nearest neighbors and trees are common models for this task.
For these models, you need to make sure that the predictors can be used.
So they will need to not have any missing values themselves.
You could in theory use a series of models you impute variables with missing data, which then will be used as predictors to predict another variable.

Methods such as Multivariate Imputation by Chained Equations[@van2012flexible] also fall into this category of imputation as well.

## Pros and Cons

### Pros

-   Likely get better performance than simple imputation

### Cons

-   More complex model
-   lower interpretability

## R Examples

There are a number of steps in the recipes package that fall under this category.
Within that, we have `step_impute_bag()`, `step_impute_knn()`, and `step_impute_linear()`.

::: callout-caution
# TODO

find a better data set
:::

Below we are showing how we can impute using a K-nearest neighbor model using `step_impute_knn()`.
We specify the variable to impute on first, and then with `impute_with` we specify which variables are used as predictors in the model.

```{r}
#| label: step_impute_knn
#| message: false
library(recipes)

impute_knn_rec <- recipe(mpg ~ ., data = mtcars) |>
  step_impute_knn(disp, neighbors = 1, impute_with = imp_vars(vs, am, hp, drat))

impute_knn_rec |>
  prep() |>
  juice()
```

## Python Examples

I'm not aware of a good way to do this for models other than KNN in a scikit-learn way.
Please file an [issue on github](https://github.com/EmilHvitfeldt/feature-engineering-az/issues/new?assignees=&labels=&projects=&template=general-issue.md&title) if you know of a good way.

```{python}
#| label: python-setup
#| echo: false
import pandas as pd
from sklearn import set_config

set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```

We are using the `ames` data set for examples.
{sklearn} provided the `KNNImputer()` method we can use.

```{python}
#| label: knnimputer
from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer

ct = ColumnTransformer(
    [('na_indicator', KNNImputer(), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF',  'Mas_Vnr_Area'])], 
    remainder="passthrough")

ct.fit(ames)
ct.transform(ames)
```

The argument `n_neighbors` is something you might have to tune to get good performance for this type of imputing method.

```{python}
#| label: knnimputer-n_neighbors
ct = ColumnTransformer(
    [('na_indicator', KNNImputer(n_neighbors=15), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF',  'Mas_Vnr_Area'])], 
    remainder="passthrough")

ct.fit(ames)
ct.transform(ames)
```