-
Notifications
You must be signed in to change notification settings - Fork 11
/
categorical-label.qmd
128 lines (96 loc) · 4.29 KB
/
categorical-label.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
pagetitle: "Feature Engineering A-Z | Label Encoding"
---
# Label Encoding {#sec-categorical-label}
::: {style="visibility: hidden; height: 0px;"}
## Label Encoding
:::
**Label encoding** (also called **integer encoding**) is a method that maps the categorical levels into the integers `1` through `n` where `n` is the number of levels.
::: callout-note
Some implementations map to the values `0` through `n - 1`.
This chapter will talk about this method as if it were mapped to `1` through `n`.
:::
This method is a *trained* method since the preprocessor needs to keep a record of the possible values and their corresponding integer value.
Unseen levels can be encoded outside the range to be either `0` or `n + 1`, allowing unseen levels to be handled with minimal extra work.
::: callout-caution
# TODO
add diagram
:::
This method is often not ideal as the ordering of the levels will matter a lot for the performance of the model that needs to make sense of the generated numeric variables.
For a variable with the levels "Studio", "Apartment", "Loft", "Duplex".
This variable contains `4! = 4 * 3 * 2 * 1 = 24` different orderings.
And since the number of permutations is calculated with factorials, this number goes up fast.
With just 10 levels we are looking at 3,628,800 different orderings.
Even if some orders are better than others, It would be a very slow task to iterate through to find which ones are good.
If you have prior information about the levels, then you should use [Ordinal Encoding](categorical-ordinal).
If you are working on an implementation that works with factors, then they will be used.
Otherwise, the ordering most likely will be alphabetical or in order of occurrence.
You should check the documentation of your implementation to figure out which.
The performance of this method will depend a lot on the model.
If you are working with a linear model, then you most likely are out of luck as it wouldn't be able to use a variable where the values 2, 6, and 10 provide evidence one way, and the rest provide evidence the other way.
Three-based models will be able to do better but would do even better if label encoding was applied to begin with.
## Pros and Cons
### Pros
- Only produces a single numeric variable for each categorical variable
- Has a way to handle unseen levels, although poorly
### Cons
- Ordering of the levels matters a lot!
- Will very often give inferior performance compared to other methods.
## R Examples
We will be using the `ames` data set for these examples.
The `step_dummy()` function allows us to perform dummy encoding and one-hot encoding.
```{r}
#| label: set-seed
#| echo: false
set.seed(1234)
# To avoid changing recipe ID columns
```
```{r}
#| label: show-data
#| message: false
library(recipes)
library(modeldata)
data("ames")
ames |>
select(Sale_Price, MS_SubClass, MS_Zoning)
```
Looking at the levels of `MS_SubClass` we see that levels are set in a specific way.
It isn't alphabetical, but there isn't one clear order.
No clarification of the ordering can be done in the data documentation <http://jse.amstat.org/v19n3/decock/DataDocumentation.txt>.
```{r}
#| label: ms_subclass-levels
ames |> pull(MS_SubClass) |> levels()
```
We will be using the `step_integer()` step for this, which defaults to 1-based indexing
```{r}
#| label: step_integer
label_rec <- recipe(Sale_Price ~ ., data = ames) |>
step_integer(all_nominal_predictors()) |>
prep()
label_rec |>
bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning"))
```
## Python Examples
```{python}
#| label: python-setup
#| echo: false
import pandas as pd
from sklearn import set_config
set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```
We are using the `ames` data set for examples.
{sklearn} provided the `OrdinalEncoder()` method we can use.
Below we see how it can be used with the `MS_Zoning` columns.
We call this method label encoding only when `categories='auto'` as it automatically labels `0` to `n_categories - 1`.
```{python}
#| label: ordinalencoder
from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
ct = ColumnTransformer(
[('label', OrdinalEncoder(categories='auto'), ['MS_Zoning'])],
remainder="passthrough")
ct.fit(ames)
ct.transform(ames).filter(regex="label.*")
```