-
Notifications
You must be signed in to change notification settings - Fork 8
/
categorical-ordinal.qmd
147 lines (112 loc) · 4.83 KB
/
categorical-ordinal.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
pagetitle: "Feature Engineering A-Z | Ordinal Encoding"
---
# Ordinal Encoding {#sec-categorical-ordinal}
::: {style="visibility: hidden; height: 0px;"}
## Ordinal Encoding
:::
This method is similar to @sec-categorical-label, except that we manually specify the mapping.
This method is generally used for *ordinal* variables as they are encoded with a natural ordering.
Where this method shines compared to integer encoding is that we allow arbitrary values for encoding, thus we can have (cold = 1, warm = 5, hot = 20).
But we might as well use (cold = -1, warm = 0, hot = 1) or (cold = 1.618, warm = 2.718, hot = 3.141).
Although you would have a hard time justifying the latter.
Nothing is stopping you from using this method with an unordered categorical variable, you just need to spend some time justifying your levels.
::: callout-note
This book's framing of ordinal encoding is more general than other sources, in so far as it is described as manually giving values to levels of a categorical, whether it is ordered or not.
:::
This method feels like it but isn't a trained method.
This is because we are providing the record of the possible values and their corresponding integer value.
Unseen levels can be manually specified, but it isn't entirely obvious what their value should be.
::: callout-caution
# TODO
add diagram
:::
Manually setting values for your levels comes with some upsides and downsides.
Assuming that you have the domain expertise to apply numeric values for the levels, removes a lot of the guesswork.
This can be very effective if the numeric values selected for the levels have some intrinsic meaning.
It can remove a lot of the guesswork and trial and error that we see in integer encoding.
The downside is the other side of the coin.
We need to have the domain expertise to be able to give the levels meaningful values, otherwise, we are doing no better than integer encoding.
## Pros and Cons
### Pros
- Only produces a single numeric variable for each categorical variable
- Preserves the natural ordering of ordered values
### Cons
- Will very often give inferior performance compared to other methods
- Unseen levels need to be manually specified
## R Examples
We will be using the `ames` data set for these examples.
The `step_dummy()` function allows us to perform dummy encoding and one-hot encoding.
```{r}
#| label: set-seed
#| echo: false
set.seed(1234)
# To avoid changing recipe ID columns
```
```{r}
#| label: show-data
#| message: false
library(recipes)
library(modeldata)
data("ames")
ames |>
select(Lot_Shape, Land_Slope)
```
Looking at the levels of `Lot_Shape` and `Land_Slope` we see that they match the levels in the documentation <http://jse.amstat.org/v19n3/decock/DataDocumentation.txt>.
Furthermore, these variables are listed as ordinal, they just aren't denoted like this in this data set.
```{r}
#| label: show-levels
ames |> pull(Lot_Shape) |> levels()
ames |> pull(Land_Slope) |> levels()
```
We will fix that by turning them into ordered factors.
```{r}
#| label: as-ordered
ames <- ames |>
mutate(across(.cols = c(Lot_Shape, Land_Slope), .fns = as.ordered))
```
to perform ordinal encoding we will use the `step_ordinalscore()` step.
This defaults to giving each level values between `1` and `n`, much like `step_integer()`.
```{r}
#| label: step_ordinalscore
ordinal_rec <- recipe(Sale_Price ~ ., data = ames) |>
step_ordinalscore(Lot_Shape, Land_Slope) |>
prep()
ordinal_rec |>
bake(new_data = NULL, starts_with("Lot_Shape"), starts_with("Land_Slope"))
```
What we can do is define a special transformation function for each of the steps.
One way is to use the `case_when()` function
```{r}
#| label: lot_shape_transformer
Lot_Shape_transformer <- function(x) {
case_when(
x == "Regular" ~ 0,
x == "Slightly_Irregular" ~ -1,
x == "Moderately_Irregular" ~ -5,
x == "Irregular" ~ -10
)
}
```
If you have the values for each of the levels as a vector or data, you can write the function to use that information as well.
```{r}
#| label: land_slope_transformer
Land_Slope_values <- c(Gtl = 0, Mod = 1, Sev = 5)
Land_Slope_transformer <- function(x) {
Land_Slope_values[x]
}
```
With these functions, we can now apply them to the respective columns by using the `convert` argument.
```{r}
#| label: step_ordinalscore-twice
ordinal_rec <- recipe(Sale_Price ~ ., data = ames) |>
step_ordinalscore(Lot_Shape, convert = Lot_Shape_transformer) |>
step_ordinalscore(Land_Slope, convert = Land_Slope_transformer) |>
prep()
ordinal_rec |>
bake(new_data = NULL, starts_with("Lot_Shape"), starts_with("Land_Slope")) |>
distinct()
```
## Python Examples
I'm not aware of a good way to do this in a scikit-learn way.
Please file an [issue on github](https://github.com/EmilHvitfeldt/feature-engineering-az/issues/new?assignees=&labels=&projects=&template=general-issue.md&title) if you know of a good way.