5.4 Supervised Encoding Methods
Beyond dummies, there are many other ways to craft one or more numerical features from a set of nominal predictors. They include
5.4.1 Likelihood Encoding
In essence, the effect of the factor level on the outcome is measured and this effect is used as new numeric features. For example, for the Ames housing data, we might calculate the mean or median sale price of a house for each neighborhood from the training data and use this statistic to represent the factor level in the model.
For classification problems, a simple logistic regression model can be used to measure the effect between the categorical outcome and the categorical predictor.
If the outcome event occurs with rate \[ p \], the odds of that event is defined as \[ p / ( 1 − p) \].
This is an example of a single generalized linear model applied to the hair color feature, which woudl otherwise have 12 dummy levels.
as_tibble(dplyr::starwars) |>
count(hair_color)
## # A tibble: 13 × 2
## hair_color n
## <chr> <int>
## 1 auburn 1
## 2 auburn, grey 1
## 3 auburn, white 1
## 4 black 13
## 5 blond 3
## 6 blonde 1
## 7 brown 18
## 8 brown, grey 1
## 9 grey 1
## 10 none 37
## 11 unknown 1
## 12 white 4
## 13 <NA> 5
recipe(skin_color ~ hair_color + eye_color + mass,
data = as_tibble(dplyr::starwars)) |>
::step_lencode_glm(hair_color, outcome = "skin_color") |>
embedprep() |> bake(new_data = NULL) |>
slice_sample(n = 10) |>
kable(
caption = "Starwars Characters hair_color GLM embedding"
|>
) kable_styling("striped", full_width = FALSE)
hair_color | eye_color | mass | skin_color |
---|---|---|---|
-2.862201 | red | 113 | green |
-21.566069 | yellow | 55 | fair, green, yellow |
-21.566069 | blue | 89 | fair |
-21.566069 | brown | 79 | tan |
-21.566069 | brown | 84 | light |
-2.862201 | black | NA | white, blue |
-21.566069 | brown | NA | tan |
-21.566069 | blue | NA | fair |
-2.862201 | red | 140 | metal |
-21.566069 | brown | NA | light |
While very fast, this method has drawbacks. For example, what happens when a factor level has a single value? Theoretically, the log-odds should be infinite in the appropriate direction but, numerically, it is usually capped at a large (and inaccurate) value.
One way around this issue is to use some type of shrinkage method. For example, the overall log-odds can be determined and, if the quality of the data within a factor level is poor, then this level’s effect estimate can be biased towards an overall estimate that disregards the levels of the predictor.
A common method for shrinking parameter estimates is Bayesian analysis. (one doubt – step_lencode_bayes
appears to only work with two class outcomes ???)
as_tibble(datasets::Titanic) |>
count(Class)
recipe(Survived ~ Class + Sex + Age,
data = as_tibble(datasets::Titanic)) |>
embed::step_lencode_bayes(Class, outcome = "Survived") |>
prep() |> bake(new_data = NULL) |>
slice_sample(n = 10)
# A tibble: 10 × 4
Class Sex Age Survived
<dbl> <fct> <fct> <fct>
1 -0.0108 Female Adult Yes
2 -0.0108 Male Adult No
3 -0.00526 Male Child Yes
4 -0.00526 Female Child No
5 -0.00993 Male Adult No
6 -0.0104 Male Child Yes
7 -0.00526 Male Child No
8 -0.0108 Female Adult No
9 -0.00993 Male Child Yes
10 -0.0104 Female Child No
Empirical Bayes methods can also be used, in the form of linear (and generalized linear) mixed models.
One issue with effect encoding, independent of the estimation method, is that it increases the possibility of overfitting the training data.
Use resampling.
Another supervised approach comes from the deep learning literature on the analysis of textual data. In addition to the dimension reduction, there is the possibility that these methods can estimate semantic relationships between words so that words with similar themes (e.g., “dog”, “pet”, etc.) have similar values in the new encodings. This technique is not limited to text data and can be used to encode any type of qualitative variable.
An example using The Office
dialogue and one of the pre-trained GloVe
embeddings.
library(textrecipes)
library(schrute)
<- textdata::embedding_glove6b(dimensions = 100) # the download is 822.2 Mb
glove6b
::theoffice |>
schruteslice_sample(n = 10) |>
select(character, text)
recipe(character ~ text,
data = schrute::theoffice) |>
step_tokenize(text, options = list(strip_punct = TRUE)) |>
step_stem(text) |>
step_word_embeddings(text, embeddings = glove6b) |>
prep() |> bake(new_data = schrute::theoffice |>
slice_sample(n = 10))
The Office dialogue word embeddings with glove6b
# A tibble: 10 × 101
character wordembe…¹ worde…² worde…³ worde…⁴ worde…⁵ worde…⁶
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Angela -1.02 -0.377 0.797 -1.21 -0.802 0.656
2 Roy 0 0 0 0 0 0
3 Phyllis -1.20 -0.373 3.20 -1.60 -1.62 -0.160
4 Kevin -1.54 1.77 5.02 -4.68 -5.23 2.91
5 Roy -2.15 0.735 4.07 -2.20 -1.30 0.297
6 Jim -0.595 0.419 0.699 -0.328 -1.20 1.70
7 Kelly -2.17 4.38 5.97 -4.91 -4.21 4.13
8 Katy -0.891 -0.889 0.0937 -0.859 1.42 1.49
9 Kevin -0.0308 0.120 0.539 -0.437 -0.739 -0.153
10 Jim -0.395 0.240 1.14 -1.27 -1.47 1.39
# … with 94 more variables: wordembed_text_d7 <dbl>,
# wordembed_text_d8 <dbl>, wordembed_text_d9 <dbl>,
# wordembed_text_d10 <dbl>, wordembed_text_d11 <dbl>,
# wordembed_text_d12 <dbl>, wordembed_text_d13 <dbl>,
# wordembed_text_d14 <dbl>, wordembed_text_d15 <dbl>,
# wordembed_text_d16 <dbl>, wordembed_text_d17 <dbl>,
# wordembed_text_d18 <dbl>, wordembed_text_d19 <dbl>, …
Note that in place of thousands of sparse dummy colums for each tokenized word, the training set consists of 100 numeric feature dimensions.
See also Textrecipes series: Pretrained Word Embedding by Emil Hvitfeldt