3.6 Categorical feature engineering
Most models require that the predictors take numeric form. There are exceptions; for example, tree-based models naturally handle numeric or categorical features. However, even tree-based models can benefit from preprocessing categorical features.
- Lumping
Sometimes features will contain levels that have very few observations. For example, there are 28 unique neighborhoods represented in the Ames housing data but several of them only have a few observations.
%>%
ames_train count(Neighborhood) %>%
arrange(n)
## # A tibble: 28 × 2
## Neighborhood n
## <fct> <int>
## 1 Landmark 1
## 2 Green_Hills 2
## 3 Greens 3
## 4 Blueste 8
## 5 Veenker 15
## 6 Northpark_Villa 17
## 7 Bloomington_Heights 18
## 8 Meadow_Village 22
## 9 Briardale 23
## 10 Clear_Creek 26
## # ℹ 18 more rows
Lump levels with step_other
- categorical
<- recipe(Sale_Price ~ ., data = ames_train) %>%
lump_rec step_other(Neighborhood, threshold = 0.01,
other = "other")
%>%
lump_rec prep() %>%
bake(new_data = ames_train) %>%
count(Neighborhood) %>%
arrange(n)
## # A tibble: 22 × 2
## Neighborhood n
## <fct> <int>
## 1 Meadow_Village 22
## 2 Briardale 23
## 3 Clear_Creek 26
## 4 South_and_West_of_Iowa_State_University 32
## 5 Stone_Brook 40
## 6 Northridge 51
## 7 Timberland 52
## 8 other 64
## 9 Brookside 74
## 10 Crawford 75
## # ℹ 12 more rows
Lump levels step_mutate
- numerical
%>%
ames_train count(Screen_Porch) %>%
arrange(-n)
## # A tibble: 95 × 2
## Screen_Porch n
## <int> <int>
## 1 0 1869
## 2 144 11
## 3 192 8
## 4 168 7
## 5 180 7
## 6 120 5
## 7 224 5
## 8 100 4
## 9 160 4
## 10 189 4
## # ℹ 85 more rows
<- recipe(Sale_Price ~ ., data = ames_train) %>%
lump_screen_porch_rec step_mutate(Screen_Porch_flag = ifelse(Screen_Porch == 0, '0', '>0')) %>%
step_string2factor(Screen_Porch_flag)
%>%
lump_screen_porch_rec prep() %>%
bake(new_data = ames_train) %>%
count(Screen_Porch_flag) %>%
arrange(n)
## # A tibble: 2 × 2
## Screen_Porch_flag n
## <fct> <int>
## 1 >0 180
## 2 0 1869
Comment: step_normalize
combines both step_center
and step_scale
.