Chapter 5 Encoding Categorical Predictors
Categorical (also called nominal) predictors are those that contain qualitative data.
Examples include:
- Education level
- ZIP code
- Text
- Day of the week
- Color
A large majority of models require that all predictors be numeric.
A summary of parsnip
model preprocessors from: Tidy Modeling with R by Max Kuhn and Julia Silge
::opts_chunk$set(fig.path = "images/")
knitr
suppressPackageStartupMessages({
library(tidyverse)
library(tidymodels)
library(embed)
library(cli)
library(kableExtra)
})
model | dummy | zv | impute | decorrelate | normalize | transform |
---|---|---|---|---|---|---|
bag_mars() | ✔ | × | ✔ | ◌ | × | ◌ |
bag_tree() | × | × | × | ◌¹ | × | × |
bart() | × | × | × | ◌¹ | × | × |
boost_tree() | ײ | ◌ | ✔² | ◌¹ | × | × |
C5_rules() | × | × | × | × | × | × |
cubist_rules() | × | × | × | × | × | × |
decision_tree() | × | × | × | ◌¹ | × | × |
discrim_flexible() | ✔ | × | ✔ | ✔ | × | ◌ |
discrim_linear() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
discrim_regularized() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
gen_additive_mod() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
linear_reg() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
logistic_reg() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
mars() | ✔ | × | ✔ | ◌ | × | ◌ |
mlp() | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
multinom_reg() | ✔ | ✔ | ✔ | ✔ | ײ | ◌ |
naive_Bayes() | × | ✔ | ✔ | ◌¹ | × | × |
nearest_neighbor() | ✔ | ✔ | ✔ | ◌ | ✔ | ✔ |
pls() | ✔ | ✔ | ✔ | × | ✔ | ✔ |
poisson_reg() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
rand_forest() | × | ◌ | ✔² | ◌¹ | × | × |
rule_fit() | ✔ | × | ✔ | ◌¹ | ✔ | × |
svm_*() | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
In the table, ✔ indicates that the method is required for the model and × indicates that it is not. The ◌
symbol means that the model may be helped by the technique but it is not required.
Algorithms for tree-based models naturally handle splitting both numeric and categorical predictors. These algorithms employ a series if/then statements that sequentially split the data into groups.
A naive Bayes model will create a cross-tabulation between a categorical predictor and the outcome class. We will return to this point in the final section of this chapter.
Simple categorical variables can also be classified as ordered or unordered.