5.1 Creating Dummy Variables for Unordered Categories
There are many methods for doing this and, to illustrate, consider a simple example for the day of the week. If we take the seven possible values and convert them into binary dummy variables, the mathematical function required to make the translation is often referred to as a contrast.
These six numeric predictors would take the place of the original categorical variable.
Why only six?
- if the values of the six dummy variables are known, then the seventh can be directly inferred.
- When the model has an intercept, an additional initial column of ones for all rows is included. Estimating the parameters for a linear model (as well as other similar models) involve inverting the matrix. If the model includes an intercept and contains dummy variables for all seven days, then the seven day columns would add up (row-wise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular).
Less than full rank encodings are sometimes called “one-hot” encodings.
Generating the full set of indicator variables may be advantageous for some models that are insensitive to linear dependencies (an example: glmnet
)
What is the interpretation of the dummy variables?
Consider a linear model for the Chicago transit data that only uses the day of the week.
Using the training set to fit the model, the intercept value estimates the mean of the reference cell, which is the average number of Sunday riders in the training set, estimated to be 3.84K people. The second model parameter, for Monday, is estimated to be 12.61K. In the reference cell model, the dummy variables represent the mean value above and beyond the reference cell mean. In this case, estimate indicates that there were 12.61K more riders on Monday than Sunday.
<- tibble(m = month.abb, number = seq(1,12, by = 1))
train_df
recipe(number ~ m, data = train_df) |>
step_dummy(all_nominal_predictors(),
one_hot = FALSE) |>
prep() |> bake(new_data = NULL, all_predictors()) |>
kable(
caption = "Preprocessing without One HOT (the default) contrasts with April"
|>
) row_spec(row = 4, background = "orange") |>
kable_styling("striped", full_width = FALSE) |>
scroll_box(width = "800px")
m_Aug | m_Dec | m_Feb | m_Jan | m_Jul | m_Jun | m_Mar | m_May | m_Nov | m_Oct | m_Sep |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
recipe(number ~ m, data = train_df) |>
step_dummy(all_nominal_predictors(),
one_hot = TRUE) |>
prep() |> bake(new_data = NULL, all_predictors()) |>
kable(
caption = "Preprocessing with One HOT."
|>
) kable_styling("striped", full_width = FALSE) |>
scroll_box(width = "800px")
m_Apr | m_Aug | m_Dec | m_Feb | m_Jan | m_Jul | m_Jun | m_Mar | m_May | m_Nov | m_Oct | m_Sep |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |