One-hot & dummy encoding
Many models require that all predictor variables be numeric. Consequently, we need to intelligently transform any categorical variables into numeric representations so that these algorithms can compute. Some packages automate this process (e.g., h2o and caret) while others do not (e.g., glmnet and keras). There are many ways to recode categorical variables as numeric (e.g., one-hot, ordinal, binary, sum, and Helmert).
The most common is referred to as one-hot encoding, where we transpose our categorical variables so that each level of the feature is represented as a boolean value.
We can one-hot or dummy encode with the same function (step_dummy()
). By default, step_dummy()
will create a full rank encoding but you can change this by setting one_hot = TRUE
.
Comment: Since one-hot encoding adds new features it can significantly increase the dimensionality of our data.
- Label encoding
Label encoding is a pure numeric conversion of the levels of a categorical variable. If a categorical variable is a factor and it has pre-specified levels then the numeric conversion will be in level order. If no levels are specified, the encoding will be based on alphabetical order.
%>%
ames_train count(MS_SubClass)
## # A tibble: 15 × 2
## MS_SubClass n
## <fct> <int>
## 1 One_Story_1946_and_Newer_All_Styles 756
## 2 One_Story_1945_and_Older 94
## 3 One_Story_with_Finished_Attic_All_Ages 4
## 4 One_and_Half_Story_Unfinished_All_Ages 13
## 5 One_and_Half_Story_Finished_All_Ages 203
## 6 Two_Story_1946_and_Newer 404
## 7 Two_Story_1945_and_Older 90
## 8 Two_and_Half_Story_All_Ages 14
## 9 Split_or_Multilevel 85
## 10 Split_Foyer 37
## 11 Duplex_All_Styles_and_Ages 76
## 12 One_Story_PUD_1946_and_Newer 132
## 13 Two_Story_PUD_1946_and_Newer 86
## 14 PUD_Multilevel_Split_Level_Foyer 12
## 15 Two_Family_conversion_All_Styles_and_Ages 43
# Label encoded
recipe(Sale_Price ~ ., data = ames_train) %>%
step_integer(MS_SubClass) %>%
prep(ames_train) %>%
bake(ames_train) %>%
count(MS_SubClass)
## # A tibble: 15 × 2
## MS_SubClass n
## <int> <int>
## 1 1 756
## 2 2 94
## 3 3 4
## 4 4 13
## 5 5 203
## 6 6 404
## 7 7 90
## 8 8 14
## 9 9 85
## 10 10 37
## 11 11 76
## 12 12 132
## 13 14 86
## 14 15 12
## 15 16 43
We should be careful with label encoding unordered categorical features because most models will treat them as ordered numeric features. If a categorical feature is naturally ordered then label encoding is a natural choice (most commonly referred to as ordinal encoding). For example, the various quality features in the Ames housing data are ordinal in nature (ranging from Very_Poor
to Very_Excellent
).
%>%
ames_train select(contains('Qual'))
## # A tibble: 2,049 × 6
## Overall_Qual Exter_Qual Bsmt_Qual Low_Qual_Fin_SF Kitchen_Qual Garage_Qual
## <fct> <fct> <fct> <int> <fct> <fct>
## 1 Average Typical Typical 0 Typical Typical
## 2 Above_Average Typical Typical 0 Typical Typical
## 3 Good Typical Good 0 Typical Typical
## 4 Above_Average Typical Good 0 Typical Typical
## 5 Average Typical Typical 0 Typical Typical
## 6 Above_Average Typical No_Basement 0 Typical Typical
## 7 Average Typical Typical 0 Typical Typical
## 8 Average Typical Typical 0 Typical Typical
## 9 Average Typical Typical 0 Typical No_Garage
## 10 Average Typical Typical 0 Typical Typical
## # ℹ 2,039 more rows
Ordinal encoding these features provides a natural and intuitive interpretation and can logically be applied to all models.
%>%
ames_train count(Overall_Qual)
## # A tibble: 10 × 2
## Overall_Qual n
## <fct> <int>
## 1 Very_Poor 3
## 2 Poor 9
## 3 Fair 33
## 4 Below_Average 147
## 5 Average 580
## 6 Above_Average 521
## 7 Good 408
## 8 Very_Good 242
## 9 Excellent 82
## 10 Very_Excellent 24
# Label encoded
recipe(Sale_Price ~ ., data = ames_train) %>%
step_integer(Overall_Qual) %>%
prep(ames_train) %>%
bake(ames_train) %>%
count(Overall_Qual)
## # A tibble: 10 × 2
## Overall_Qual n
## <int> <int>
## 1 1 3
## 2 2 9
## 3 3 33
## 4 4 147
## 5 5 580
## 6 6 521
## 7 7 408
## 8 8 242
## 9 9 82
## 10 10 24