5.1 Creating Dummy Variables for Unordered Categories

There are many methods for doing this and, to illustrate, consider a simple example for the day of the week. If we take the seven possible values and convert them into binary dummy variables, the mathematical function required to make the translation is often referred to as a contrast.

These six numeric predictors would take the place of the original categorical variable.

Why only six?

if the values of the six dummy variables are known, then the seventh can be directly inferred.
When the model has an intercept, an additional initial column of ones for all rows is included. Estimating the parameters for a linear model (as well as other similar models) involve inverting the matrix. If the model includes an intercept and contains dummy variables for all seven days, then the seven day columns would add up (row-wise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular).

Less than full rank encodings are sometimes called “one-hot” encodings.

Generating the full set of indicator variables may be advantageous for some models that are insensitive to linear dependencies (an example: glmnet)

What is the interpretation of the dummy variables?

Consider a linear model for the Chicago transit data that only uses the day of the week.

Using the training set to fit the model, the intercept value estimates the mean of the reference cell, which is the average number of Sunday riders in the training set, estimated to be 3.84K people. The second model parameter, for Monday, is estimated to be 12.61K. In the reference cell model, the dummy variables represent the mean value above and beyond the reference cell mean. In this case, estimate indicates that there were 12.61K more riders on Monday than Sunday.

train_df <- tibble(m = month.abb, number = seq(1,12, by = 1))

recipe(number ~ m, data = train_df) |> 
  step_dummy(all_nominal_predictors(),
             one_hot = FALSE) |> 
  prep() |> bake(new_data = NULL, all_predictors()) |>  
    kable(
    caption = "Preprocessing without One HOT (the default) contrasts with April"
  ) |>  
  row_spec(row = 4, background = "orange") |> 
  kable_styling("striped", full_width = FALSE) |> 
  scroll_box(width = "800px")

(#tab:chapter 5 one hot dummies)Preprocessing without One HOT (the default) contrasts with April
m_Aug	m_Dec	m_Feb	m_Jan	m_Jul	m_Jun	m_Mar	m_May	m_Nov	m_Oct	m_Sep
0	0	0	1	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	1	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	1	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	1
0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	1	0	0
0	1	0	0	0	0	0	0	0	0	0

recipe(number ~ m, data = train_df) |> 
  step_dummy(all_nominal_predictors(),
             one_hot = TRUE) |> 
  prep() |> bake(new_data = NULL, all_predictors())  |> 
    kable(
    caption = "Preprocessing with One HOT."
  ) |>  
  kable_styling("striped", full_width = FALSE) |> 
  scroll_box(width = "800px")

(#tab:chapter 5 one hot dummies)Preprocessing with One HOT.
m_Apr	m_Aug	m_Dec	m_Feb	m_Jan	m_Jul	m_Jun	m_Mar	m_May	m_Nov	m_Oct	m_Sep
0	0	0	0	1	0	0	0	0	0	0	0
0	0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	0	1	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	1
0	0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	0	1	0	0
0	0	1	0	0	0	0	0	0	0	0	0

m_Aug	m_Dec	m_Feb	m_Jan	m_Jul	m_Jun	m_Mar	m_May	m_Nov	m_Oct	m_Sep
0	0	0	1	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	1	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	1	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	1
0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	1	0	0
0	1	0	0	0	0	0	0	0	0	0

m_Apr	m_Aug	m_Dec	m_Feb	m_Jan	m_Jul	m_Jun	m_Mar	m_May	m_Nov	m_Oct	m_Sep
0	0	0	0	1	0	0	0	0	0	0	0
0	0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	0	1	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	1
0	0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	0	1	0	0
0	0	1	0	0	0	0	0	0	0	0	0

m_Aug	m_Dec	m_Feb	m_Jan	m_Jul	m_Jun	m_Mar	m_May	m_Nov	m_Oct	m_Sep
0	0	0	1	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	1	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	1	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	1
0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	1	0	0
0	1	0	0	0	0	0	0	0	0	0

m_Apr	m_Aug	m_Dec	m_Feb	m_Jan	m_Jul	m_Jun	m_Mar	m_May	m_Nov	m_Oct	m_Sep
0	0	0	0	1	0	0	0	0	0	0	0
0	0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	0	1	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	1
0	0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	0	1	0	0
0	0	1	0	0	0	0	0	0	0	0	0

m_Aug	m_Dec	m_Feb	m_Jan	m_Jul	m_Jun	m_Mar	m_May	m_Nov	m_Oct	m_Sep
0	0	0	1	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	1	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	1	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	1
0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	1	0	0
0	1	0	0	0	0	0	0	0	0	0

m_Apr	m_Aug	m_Dec	m_Feb	m_Jan	m_Jul	m_Jun	m_Mar	m_May	m_Nov	m_Oct	m_Sep
0	0	0	0	1	0	0	0	0	0	0	0
0	0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	0	1	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	1
0	0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	0	1	0	0
0	0	1	0	0	0	0	0	0	0	0	0