Alternatives

There are several alternative categorical encodings that are implemented in various R machine learning engines and are worth exploring. For example, target encoding is the process of replacing a categorical value with the mean (regression) or proportion (classification) of the target variable. For example, target encoding the Neighborhood feature would change North_Ames to 143517.

ames_train %>%
     group_by(Neighborhood) %>%
     summarize(`Avg Sale_Price` = mean(Sale_Price, na.rm = TRUE)) %>%
     head(10) %>%
     kable(caption = "Example of target encoding the Neighborhood feature of the Ames housing data set.") %>%
     kable_styling(bootstrap_options = "striped", full_width = TRUE)
Table 3.2: Example of target encoding the Neighborhood feature of the Ames housing data set.
Neighborhood Avg Sale_Price
North_Ames 143516.8
College_Creek 200523.5
Old_Town 123908.0
Edwards 133321.1
Somerset 231310.8
Northridge_Heights 321503.1
Gilbert 188170.0
Sawyer 138925.1
Northwest_Ames 188151.8
Sawyer_West 183598.1

Target encoding runs the risk of data leakage since you are using the response variable to encode a feature. An alternative is to change the feature value to represent the proportion a particular level represents for a given feature. In this case, North_Ames would be changed to 0.149.

ames_train %>%
  count(Neighborhood) %>%
  mutate(Proportion = n / sum(n)) %>%
  select(-n) %>%
  head(10) %>%
  kable(caption = 'Example of categorical proportion encoding the Neighborhood feature of the Ames housing data set.') %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)
Table 3.3: Example of categorical proportion encoding the Neighborhood feature of the Ames housing data set.
Neighborhood Proportion
North_Ames 0.1493411
College_Creek 0.0971205
Old_Town 0.0800390
Edwards 0.0639336
Somerset 0.0595412
Northridge_Heights 0.0566130
Gilbert 0.0541728
Sawyer 0.0527086
Northwest_Ames 0.0400195
Sawyer_West 0.0458760

For other alternatives to categorical encoding, refer to embed package (https://embed.tidymodels.org/)