- One column with a number representation of the category
- The impact of violating this ordering assumption is really dependent on the downstream models.
- Using ordinal encoding will output ordinal categories. This means that there is an order in the resulting categories (e.g.
0 < 1 < 2
). - Ordinal encoding is often a good strategy with tree-based models.
- Linear models will be impacted by misordered categories while tree-based models will not.
- Using ordinal encoding will output ordinal categories. This means that there is an order in the resulting categories (e.g.
- You can still use an ordinal encoding with linear models but you need to be sure that:
- the original categories (before encoding) have an ordering;
- the encoded categories follow the same ordering as the original categories.
- Advantage: easily make (textual) categories numerical
- Disadvantage: most ML is based on linear algebra, thus, the model will take into account the vector distances, which is undesired behavior.
- Produces one column with boolean value per category, e.g.
1
- In general one-hot encoding is the encoding strategy used when the downstream models are linear models.
- One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use one-hot encoding in such cases even if the original categories do not have a given order.
- Advantage: value is either 1 or 0, thus distances between all categories are equal.
- Disadvantage: this approach might blow up your feature space if you have many categories. Possible solution is to only use the top X categories and make the rest "Other".
- [[pandas-one-hot-encoding]]
- Alternative for one-hot encoding
- Replace the categorical value with the some descriptive statistic (e.g. mean) of the target variable per category (e.g. the average housing price of the neighbourhood instead of the neighbourhood as a category).
[[data-preprocessing]]