-
Notifications
You must be signed in to change notification settings - Fork 8
/
missing-simple.qmd
236 lines (189 loc) · 8.33 KB
/
missing-simple.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
pagetitle: "Feature Engineering A-Z | Simple Imputation"
---
# Simple Imputation {#sec-missing-simple}
::: {style="visibility: hidden; height: 0px;"}
## Simple Imputation
:::
When dealing with missing data, the first suggestion is often imputation.
**Simple imputation** as covered in this chapter refers to the method where for each variable a single value is selected, and all missing values will be replaced by that value.
It is important to remind ourselves why we impute missing values.
Because the model we are using isn't able to handle missing values natively.
Simple imputation is the simplest way to handle missing values.
The way we find this value will depend on the variable type, and within each type of variable, we find multiple ways to decide on a value.
The main categories are categorical and numeric variables.
For categorical we have 2 main options.
We can replace the value of `NA` with `"missing"` or some other unused value.
This would denote that the value was missing, without trying to pick a value.
It would change the cardinality of the predictors as variables with levels "new", "used", and "old" would go from 3 levels to 4.
Which may be undesirable.
Especially with ordinal variables as it is unclear where the `"missing"` should go in the ordering.
The simple imputation way to deal with categorization is to find the value that we want to impute with.
Typically this will be calculated with the mode, e.i.
the most frequent value.
It doesn't have to be the most frequent value, you could set up the imputation to pick the 5th most common value or the least common value.
But this is quite uncommon, and the mode is used by far the most.
These methods also work for low cardinality integers.
For numeric values, we don't have the option to add a new type of value.
One option is to manually select the imputed value.
If this approach is taken, then it should be done with utmost caution!
It would also make it an unlearned imputation method.
What is typically done is that some value is picked as the replacement.
Be it the median, mean or even mode.
Datetime variables will be a different story.
One could use the mode, or an adjusted mean or median.
Another way is to let the value extraction work first, and then apply imputation to the extracted variables.
Time series data is different enough that it has its chapter in @sec-time-series-missing.
One of the main downsides to simple imputation is that it can lead to impossible configurations in the data.
Imagine that the total square area is missing, but we know the number of rooms and number of bedrooms.
Certain combinations are more likely than others.
Below is the classic ames data set
```{r}
#| label: fig-missing-simple-heatmap
#| echo: false
#| message: false
#| fig-cap: |
#| All houses have fewer bedrooms than rooms themselves.
#| fig-alt: |
#| Tile chart. Total number of rooms along the x-axis, number of bedrooms
#| along the y-axis. All of the observations have more rooms than bedrooms
#| with most of the observations having around 3 bedrooms and 6 rooms in
#| total.
library(tidymodels)
ames |>
count(TotRms_AbvGrd, Bedroom_AbvGr) |>
ggplot(aes(TotRms_AbvGrd, Bedroom_AbvGr, fill = log(n))) +
geom_tile() +
geom_abline(slope = 1, intercept = 0) +
coord_fixed() +
theme_minimal() +
scale_fill_viridis_c()
```
There can't be more bedrooms than the total number of rooms.
And we see that in the data.
The average number of bedrooms is `r round(mean(ames$Bedroom_AbvGr), 2)`, and if we round, then it will be `r round(mean(ames$Bedroom_AbvGr), 0)`.
That is perfectly fine for a house with an average number of rooms, but it will be impossible for small houses and quite inadequate for large houses.
This is bad but can be seen as an improvement to the situation where the model didn't fit because a missing value was present.
This scenario is part of the motivation for @sec-missing-model.
Other drawbacks of simple include; reducing the variance and standard deviation of the data.
This happens because we are adding zero-variance information to the variables.
In the same vein, we are changing the distribution of our variables, which can also affect downstream modeling and feature engineering.
Below we see this in effect, as more and more missing data, leads to a larger peak of the mean of the distribution.
```{r}
#| label: fig-missing-simple-mean
#| echo: false
#| message: false
#| fig-cap: |
#| Same distribution with a longer and longer spike at the mean.
#| fig-alt: |
#| Facetted histogram chart. Predictor along the x-axis, count along the
#| y-axis. facetted along 0%, 5%, 10%, and 15%. All of the distributions
#| appear identical, with the the 5%, 10%, and 15% having a simple spike at
#| the mean value, each being higher than the previous.
set.seed(1234)
x <- c(rnorm(1000, mean = 100, sd = 20), rnorm(600, mean = 150, sd = 10))
x <- sample(x)
x_05 <- x
x_05[seq_len(floor(length(x) * 0.05))] <- mean(x)
x_10 <- x
x_10[seq_len(floor(length(x) * 0.10))] <- mean(x)
x_15 <- x
x_15[seq_len(floor(length(x) * 0.15))] <- mean(x)
tibble(
x = c(x, x_05, x_10, x_15),
percentage = rep(c("0", "5%", "10%", "15%"), each = 1600)
) |>
mutate(percentage = factor(percentage, levels = c("0", "5%", "10%", "15%"))) |>
ggplot(aes(x)) +
geom_histogram(bins = 50) +
facet_wrap(~percentage) +
theme_minimal()
```
Another thing we can do, while we stay in this domain of only using a single variable to impute itself, is to impute using the original distribution.
So instead of imputing by the mode, a sample is drawn from the non-missing values, and that value is used.
This is then done for each observation.
Each observation won't get the same values, but it will preserve the distribution, variance and standard deviation.
It won't help with the relationship between variables and as an added downside, it adds noise into imputation, making it a seeded feature engineering method.
On the implementation side, we need to be careful about how we extract the original distribution.
This distribution needs to be saved for later reapply.
Imagine a numeric predictor, if it is a non-integer, it will likely take many unique values, if not all unique values.
We might want to bin the data to create a distribution that way to the same memory.
## Pros and Cons
### Pros
- Fast computationally
- Easy to explain what was done
### Cons
- Doesn't preserve relationship between predictors
- reducing the variance and standard deviation of the data
- unlikely to help performance
## R Examples
The recipes package contains several steps.
It includes the steps `step_impute_mean()`, `step_impute_median()` and `step_impute_mode()` which imputes by the mean, median and mode respectively.
::: callout-caution
# TODO
find a good data set with missing values
:::
```{r}
#| label: step_impute_
library(recipes)
library(modeldata)
impute_rec <- recipe(Sale_Price ~ ., data = ames) |>
step_impute_mean(contains("Area")) |>
step_impute_median(contains("_SF")) |>
step_impute_mode(all_nominal_predictors()) |>
prep()
```
We can use the `tidy()` function to find the estimated mean
```{r}
#| label: tidy-one
impute_rec |>
tidy(1)
```
estimated median
```{r}
#| label: tidy-two
impute_rec |>
tidy(2)
```
and estimated mode
```{r}
#| label: tidy-three
impute_rec |>#|
tidy(3)
```
::: callout-caution
# TODO
wait for the distribution step
:::
## Python Examples
```{python}
#| label: python-setup
#| echo: false
import pandas as pd
from sklearn import set_config
set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```
We are using the `ames` data set for examples.
{sklearn} provided the `SimpleImputer()` method we can use.
The main argument we will use is `strategy` which we can set to determine the type of imputing.
```{python}
#| label: simpleimputer
from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
ct = ColumnTransformer(
[('mean_impute', SimpleImputer(strategy='mean'), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF', 'Mas_Vnr_Area'])],
remainder="passthrough")
ct.fit(ames)
ct.transform(ames)
```
Setting `strategy='median'` switches the imputer to do median imputing.
```{python}
#| label: simpleimputer-median
ct = ColumnTransformer(
[('median_impute', SimpleImputer(strategy='median'), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF', 'Mas_Vnr_Area'])],
remainder="passthrough")
ct.fit(ames)
ct.transform(ames)
```