17.2 Encoding methods:

  1. Effect or likelihood encodings —-> No Pooling and Partial Pooling

“you create an effect encoding for your categorical variable”

This can be seen when the transformation happens between the levels of the categorical variable and another numerical variable in the set.

An example would be:

“These steps use a generalized linear model to estimate the effect of each level in a categorical predictor on the outcome.”

lencode stands for linear encoding

  • step_lencode_glm() —-> mixed or hierarchical generalized linear model
  • step_lencode_mixed() —-> partial pooling
  • step_lencode_bayes() —-> Bayesian hierarchical model

The {embed} package documentation provides some more detailed information about different types of step_ that can be used.

pooling —> we shrink the effect estimates toward the mean

  1. Feature hashing

Create dummy variables, but only consider the value of the category to assign it to a predefined pool of dummy variables. It is for text data and high cardinality.

  • rlang::hash()

    mutate(Hash = map_chr(..categorical.., hash))

Neighborhoods are called the “keys”, while the outputs are the “hashes”. The number of possible hashes can be customized as it is a hyperparameter.

  • strtoi() —-> Convert Strings to Integers

    mutate(Hash = strtoi(substr(Hash, 26, 32), base = 16L), Hash = Hash %% 16)

  1. Entity embeddings

To transform a categorical variable with many levels to a set of lower-dimensional vectors.

Embeddings is learned via a TensorFlow neural network.

  • step_embed() —-> TensorFlow neural network
  • step_woe() —-> weight of evidence transformation-Bayes factor

17.2.1 Cohort 4

Meeting chat log
00:44:41    Stephen.Charlesworth:   https://www.amazon.com/Machine-Learning-Design-Patterns-Preparation/dp/1098115783
00:50:20    Federica Gazzelloni:    https://dl.acm.org/doi/10.1145/507533.507538
00:52:42    Federica Gazzelloni:    https://arxiv.org/abs/1611.09477
00:53:26    Stephen.Charlesworth:   https://community.tibco.com/feed-items/comparison-different-encoding-methods-using-tibco-data-science