Alternatives
There are several alternative categorical encodings that are implemented in various R machine learning engines and are worth exploring. For example, target encoding is the process of replacing a categorical value with the mean (regression) or proportion (classification) of the target variable. For example, target encoding the Neighborhood feature would change North_Ames to 143517.
ames_train %>%
group_by(Neighborhood) %>%
summarize(`Avg Sale_Price` = mean(Sale_Price, na.rm = TRUE)) %>%
head(10) %>%
kable(caption = "Example of target encoding the Neighborhood feature of the Ames housing data set.") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE)| Neighborhood | Avg Sale_Price |
|---|---|
| North_Ames | 143516.8 |
| College_Creek | 200523.5 |
| Old_Town | 123908.0 |
| Edwards | 133321.1 |
| Somerset | 231310.8 |
| Northridge_Heights | 321503.1 |
| Gilbert | 188170.0 |
| Sawyer | 138925.1 |
| Northwest_Ames | 188151.8 |
| Sawyer_West | 183598.1 |
Target encoding runs the risk of data leakage since you are using the response variable to encode a feature. An alternative is to change the feature value to represent the proportion a particular level represents for a given feature. In this case, North_Ames would be changed to 0.149.
ames_train %>%
count(Neighborhood) %>%
mutate(Proportion = n / sum(n)) %>%
select(-n) %>%
head(10) %>%
kable(caption = 'Example of categorical proportion encoding the Neighborhood feature of the Ames housing data set.') %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE)| Neighborhood | Proportion |
|---|---|
| North_Ames | 0.1493411 |
| College_Creek | 0.0971205 |
| Old_Town | 0.0800390 |
| Edwards | 0.0639336 |
| Somerset | 0.0595412 |
| Northridge_Heights | 0.0566130 |
| Gilbert | 0.0541728 |
| Sawyer | 0.0527086 |
| Northwest_Ames | 0.0400195 |
| Sawyer_West | 0.0458760 |
For other alternatives to categorical encoding, refer to embed package (https://embed.tidymodels.org/)