17.2 Encoding methods:

Effect or likelihood encodings —-> No Pooling and Partial Pooling

“you create an effect encoding for your categorical variable”

This can be seen when the transformation happens between the levels of the categorical variable and another numerical variable in the set.

An example would be:

“These steps use a generalized linear model to estimate the effect of each level in a categorical predictor on the outcome.”

lencode stands for linear encoding

step_lencode_glm() —-> mixed or hierarchical generalized linear model
step_lencode_mixed() —-> partial pooling
step_lencode_bayes() —-> Bayesian hierarchical model

The {embed} package documentation provides some more detailed information about different types of step_ that can be used.

pooling —> we shrink the effect estimates toward the mean

Feature hashing

Create dummy variables, but only consider the value of the category to assign it to a predefined pool of dummy variables. It is for text data and high cardinality.

rlang::hash()

mutate(Hash = map_chr(..categorical.., hash))

Neighborhoods are called the “keys”, while the outputs are the “hashes”. The number of possible hashes can be customized as it is a hyperparameter.

strtoi() —-> Convert Strings to Integers

mutate(Hash = strtoi(substr(Hash, 26, 32), base = 16L), Hash = Hash %% 16)

Entity embeddings

To transform a categorical variable with many levels to a set of lower-dimensional vectors.

Embeddings is learned via a TensorFlow neural network.

step_embed() —-> TensorFlow neural network
step_woe() —-> weight of evidence transformation-Bayes factor

17.2.1 Cohort 4

Meeting chat log

00:44:41    Stephen.Charlesworth:   https://www.amazon.com/Machine-Learning-Design-Patterns-Preparation/dp/1098115783
00:50:20    Federica Gazzelloni:    https://dl.acm.org/doi/10.1145/507533.507538
00:52:42    Federica Gazzelloni:    https://arxiv.org/abs/1611.09477
00:53:26    Stephen.Charlesworth:   https://community.tibco.com/feed-items/comparison-different-encoding-methods-using-tibco-data-science