3.6 Categorical feature engineering

Most models require that the predictors take numeric form. There are exceptions; for example, tree-based models naturally handle numeric or categorical features. However, even tree-based models can benefit from preprocessing categorical features.

  • Lumping

Sometimes features will contain levels that have very few observations. For example, there are 28 unique neighborhoods represented in the Ames housing data but several of them only have a few observations.

ames_train %>% 
     count(Neighborhood) %>% 
     arrange(n)
## # A tibble: 28 × 2
##    Neighborhood            n
##    <fct>               <int>
##  1 Landmark                1
##  2 Green_Hills             2
##  3 Greens                  3
##  4 Blueste                 8
##  5 Veenker                15
##  6 Northpark_Villa        17
##  7 Bloomington_Heights    18
##  8 Meadow_Village         22
##  9 Briardale              23
## 10 Clear_Creek            26
## # ℹ 18 more rows

Lump levels with step_other - categorical

lump_rec <- recipe(Sale_Price ~ ., data = ames_train) %>% 
     step_other(Neighborhood, threshold = 0.01, 
                other = "other")

lump_rec %>% 
     prep() %>% 
     bake(new_data = ames_train) %>%
     count(Neighborhood) %>% 
     arrange(n)
## # A tibble: 22 × 2
##    Neighborhood                                n
##    <fct>                                   <int>
##  1 Meadow_Village                             22
##  2 Briardale                                  23
##  3 Clear_Creek                                26
##  4 South_and_West_of_Iowa_State_University    32
##  5 Stone_Brook                                40
##  6 Northridge                                 51
##  7 Timberland                                 52
##  8 other                                      64
##  9 Brookside                                  74
## 10 Crawford                                   75
## # ℹ 12 more rows

Lump levels step_mutate - numerical

ames_train %>% 
     count(Screen_Porch) %>% 
     arrange(-n)
## # A tibble: 95 × 2
##    Screen_Porch     n
##           <int> <int>
##  1            0  1869
##  2          144    11
##  3          192     8
##  4          168     7
##  5          180     7
##  6          120     5
##  7          224     5
##  8          100     4
##  9          160     4
## 10          189     4
## # ℹ 85 more rows
lump_screen_porch_rec <- recipe(Sale_Price ~ ., data = ames_train) %>% 
     step_mutate(Screen_Porch_flag = ifelse(Screen_Porch == 0, '0', '>0')) %>% 
     step_string2factor(Screen_Porch_flag)

lump_screen_porch_rec %>% 
     prep() %>% 
     bake(new_data = ames_train) %>%
     count(Screen_Porch_flag) %>% 
     arrange(n)
## # A tibble: 2 × 2
##   Screen_Porch_flag     n
##   <fct>             <int>
## 1 >0                  180
## 2 0                  1869

Comment: step_normalize combines both step_center and step_scale.