Alternatives

There are several alternative categorical encodings that are implemented in various R machine learning engines and are worth exploring. For example, target encoding is the process of replacing a categorical value with the mean (regression) or proportion (classification) of the target variable. For example, target encoding the Neighborhood feature would change North_Ames to 143517.

ames_train %>%
     group_by(Neighborhood) %>%
     summarize(`Avg Sale_Price` = mean(Sale_Price, na.rm = TRUE)) %>%
     head(10) %>%
     kable(caption = "Example of target encoding the Neighborhood feature of the Ames housing data set.") %>%
     kable_styling(bootstrap_options = "striped", full_width = TRUE)

Table 3.2: Example of target encoding the Neighborhood feature of the Ames housing data set.
Neighborhood	Avg Sale_Price
North_Ames	143516.8
College_Creek	200523.5
Old_Town	123908.0
Edwards	133321.1
Somerset	231310.8
Northridge_Heights	321503.1
Gilbert	188170.0
Sawyer	138925.1
Northwest_Ames	188151.8
Sawyer_West	183598.1

Target encoding runs the risk of data leakage since you are using the response variable to encode a feature. An alternative is to change the feature value to represent the proportion a particular level represents for a given feature. In this case, North_Ames would be changed to 0.149.

ames_train %>%
  count(Neighborhood) %>%
  mutate(Proportion = n / sum(n)) %>%
  select(-n) %>%
  head(10) %>%
  kable(caption = 'Example of categorical proportion encoding the Neighborhood feature of the Ames housing data set.') %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)

Table 3.3: Example of categorical proportion encoding the Neighborhood feature of the Ames housing data set.
Neighborhood	Proportion
North_Ames	0.1493411
College_Creek	0.0971205
Old_Town	0.0800390
Edwards	0.0639336
Somerset	0.0595412
Northridge_Heights	0.0566130
Gilbert	0.0541728
Sawyer	0.0527086
Northwest_Ames	0.0400195
Sawyer_West	0.0458760

For other alternatives to categorical encoding, refer to embed package (https://embed.tidymodels.org/)