5.4 Supervised Encoding Methods

Beyond dummies, there are many other ways to craft one or more numerical features from a set of nominal predictors. They include

5.4.1 Likelihood Encoding

In essence, the effect of the factor level on the outcome is measured and this effect is used as new numeric features. For example, for the Ames housing data, we might calculate the mean or median sale price of a house for each neighborhood from the training data and use this statistic to represent the factor level in the model.

For classification problems, a simple logistic regression model can be used to measure the effect between the categorical outcome and the categorical predictor.

If the outcome event occurs with rate \[ p \], the odds of that event is defined as \[ p / ( 1 − p) \].

This is an example of a single generalized linear model applied to the hair color feature, which woudl otherwise have 12 dummy levels.

as_tibble(dplyr::starwars) |> 
  count(hair_color)

## # A tibble: 13 × 2
##    hair_color        n
##    <chr>         <int>
##  1 auburn            1
##  2 auburn, grey      1
##  3 auburn, white     1
##  4 black            13
##  5 blond             3
##  6 blonde            1
##  7 brown            18
##  8 brown, grey       1
##  9 grey              1
## 10 none             37
## 11 unknown           1
## 12 white             4
## 13 <NA>              5

recipe(skin_color ~ hair_color +  eye_color + mass, 
       data = as_tibble(dplyr::starwars)) |> 
  embed::step_lencode_glm(hair_color, outcome = "skin_color") |> 
  prep() |> bake(new_data = NULL)    |> 
  slice_sample(n = 10) |> 
  kable(
    caption = "Starwars Characters hair_color GLM embedding"
  ) |>  
  kable_styling("striped", full_width = FALSE)

(#tab:chapter 5 glm numeric embeddings)Starwars Characters hair_color GLM embedding
hair_color	eye_color	mass	skin_color
-2.862201	red	113	green
-21.566069	yellow	55	fair, green, yellow
-21.566069	blue	89	fair
-21.566069	brown	79	tan
-21.566069	brown	84	light
-2.862201	black	NA	white, blue
-21.566069	brown	NA	tan
-21.566069	blue	NA	fair
-2.862201	red	140	metal
-21.566069	brown	NA	light

While very fast, this method has drawbacks. For example, what happens when a factor level has a single value? Theoretically, the log-odds should be infinite in the appropriate direction but, numerically, it is usually capped at a large (and inaccurate) value.

One way around this issue is to use some type of shrinkage method. For example, the overall log-odds can be determined and, if the quality of the data within a factor level is poor, then this level’s effect estimate can be biased towards an overall estimate that disregards the levels of the predictor.

A common method for shrinking parameter estimates is Bayesian analysis. (one doubt – step_lencode_bayes appears to only work with two class outcomes ???)

as_tibble(datasets::Titanic) |> 
  count(Class)

recipe(Survived ~ Class +  Sex + Age, 
       data = as_tibble(datasets::Titanic)) |> 
  embed::step_lencode_bayes(Class, outcome = "Survived") |> 
  prep() |> bake(new_data = NULL)    |>
  slice_sample(n = 10)

# A tibble: 10 × 4
      Class Sex    Age   Survived
      <dbl> <fct>  <fct> <fct>   
 1 -0.0108  Female Adult Yes     
 2 -0.0108  Male   Adult No      
 3 -0.00526 Male   Child Yes     
 4 -0.00526 Female Child No      
 5 -0.00993 Male   Adult No      
 6 -0.0104  Male   Child Yes     
 7 -0.00526 Male   Child No      
 8 -0.0108  Female Adult No      
 9 -0.00993 Male   Child Yes     
10 -0.0104  Female Child No

Empirical Bayes methods can also be used, in the form of linear (and generalized linear) mixed models.

One issue with effect encoding, independent of the estimation method, is that it increases the possibility of overfitting the training data.

Use resampling.

Another supervised approach comes from the deep learning literature on the analysis of textual data. In addition to the dimension reduction, there is the possibility that these methods can estimate semantic relationships between words so that words with similar themes (e.g., “dog”, “pet”, etc.) have similar values in the new encodings. This technique is not limited to text data and can be used to encode any type of qualitative variable.

An example using The Office dialogue and one of the pre-trained GloVe embeddings.

library(textrecipes)
library(schrute)

glove6b <- textdata::embedding_glove6b(dimensions = 100) # the download is 822.2 Mb

schrute::theoffice |> 
  slice_sample(n = 10) |> 
  select(character, text)

recipe(character ~ text,
       data = schrute::theoffice) |>
  step_tokenize(text, options = list(strip_punct = TRUE)) |>
  step_stem(text) |>
  step_word_embeddings(text, embeddings = glove6b) |>
  prep() |> bake(new_data = schrute::theoffice |>
                   slice_sample(n = 10))

The Office dialogue word embeddings with glove6b

# A tibble: 10 × 101
   character wordembe…¹ worde…² worde…³ worde…⁴ worde…⁵ worde…⁶
   <fct>          <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Angela       -1.02    -0.377  0.797   -1.21   -0.802   0.656
 2 Roy           0        0      0        0       0       0    
 3 Phyllis      -1.20    -0.373  3.20    -1.60   -1.62   -0.160
 4 Kevin        -1.54     1.77   5.02    -4.68   -5.23    2.91 
 5 Roy          -2.15     0.735  4.07    -2.20   -1.30    0.297
 6 Jim          -0.595    0.419  0.699   -0.328  -1.20    1.70 
 7 Kelly        -2.17     4.38   5.97    -4.91   -4.21    4.13 
 8 Katy         -0.891   -0.889  0.0937  -0.859   1.42    1.49 
 9 Kevin        -0.0308   0.120  0.539   -0.437  -0.739  -0.153
10 Jim          -0.395    0.240  1.14    -1.27   -1.47    1.39 
# … with 94 more variables: wordembed_text_d7 <dbl>,
#   wordembed_text_d8 <dbl>, wordembed_text_d9 <dbl>,
#   wordembed_text_d10 <dbl>, wordembed_text_d11 <dbl>,
#   wordembed_text_d12 <dbl>, wordembed_text_d13 <dbl>,
#   wordembed_text_d14 <dbl>, wordembed_text_d15 <dbl>,
#   wordembed_text_d16 <dbl>, wordembed_text_d17 <dbl>,
#   wordembed_text_d18 <dbl>, wordembed_text_d19 <dbl>, …

Note that in place of thousands of sparse dummy colums for each tokenized word, the training set consists of 100 numeric feature dimensions.

See also Textrecipes series: Pretrained Word Embedding by Emil Hvitfeldt