Alternatives
There are several alternative categorical encodings that are implemented in various R machine learning engines and are worth exploring. For example, target encoding is the process of replacing a categorical value with the mean (regression) or proportion (classification) of the target variable. For example, target encoding the Neighborhood
feature would change North_Ames
to 143517.
%>%
ames_train group_by(Neighborhood) %>%
summarize(`Avg Sale_Price` = mean(Sale_Price, na.rm = TRUE)) %>%
head(10) %>%
kable(caption = "Example of target encoding the Neighborhood feature of the Ames housing data set.") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE)
Neighborhood | Avg Sale_Price |
---|---|
North_Ames | 143516.8 |
College_Creek | 200523.5 |
Old_Town | 123908.0 |
Edwards | 133321.1 |
Somerset | 231310.8 |
Northridge_Heights | 321503.1 |
Gilbert | 188170.0 |
Sawyer | 138925.1 |
Northwest_Ames | 188151.8 |
Sawyer_West | 183598.1 |
Target encoding runs the risk of data leakage
since you are using the response variable to encode a feature. An alternative is to change the feature value to represent the proportion a particular level represents for a given feature. In this case, North_Ames would be changed to 0.149.
%>%
ames_train count(Neighborhood) %>%
mutate(Proportion = n / sum(n)) %>%
select(-n) %>%
head(10) %>%
kable(caption = 'Example of categorical proportion encoding the Neighborhood feature of the Ames housing data set.') %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE)
Neighborhood | Proportion |
---|---|
North_Ames | 0.1493411 |
College_Creek | 0.0971205 |
Old_Town | 0.0800390 |
Edwards | 0.0639336 |
Somerset | 0.0595412 |
Northridge_Heights | 0.0566130 |
Gilbert | 0.0541728 |
Sawyer | 0.0527086 |
Northwest_Ames | 0.0400195 |
Sawyer_West | 0.0458760 |
For other alternatives to categorical encoding, refer to embed
package (https://embed.tidymodels.org/)