Chapter 5 Encoding Categorical Predictors

Categorical (also called nominal) predictors are those that contain qualitative data.

Examples include:

Education level
ZIP code
Text
Day of the week
Color

A large majority of models require that all predictors be numeric.

A summary of parsnip model preprocessors from: Tidy Modeling with R by Max Kuhn and Julia Silge

knitr::opts_chunk$set(fig.path = "images/")

suppressPackageStartupMessages({
library(tidyverse)
library(tidymodels)
library(embed)
library(cli)
library(kableExtra)
})

Table 5.1: Preprocessing methods for different models.
model	dummy	zv	impute	decorrelate	normalize	transform
`bag_mars()`	✔	×	✔	◌	×	◌
`bag_tree()`	×	×	×	◌¹	×	×
`bart()`	×	×	×	◌¹	×	×
`boost_tree()`	×²	◌	✔²	◌¹	×	×
`C5_rules()`	×	×	×	×	×	×
`cubist_rules()`	×	×	×	×	×	×
`decision_tree()`	×	×	×	◌¹	×	×
`discrim_flexible()`	✔	×	✔	✔	×	◌
`discrim_linear()`	✔	✔	✔	✔	×	◌
`discrim_regularized()`	✔	✔	✔	✔	×	◌
`gen_additive_mod()`	✔	✔	✔	✔	×	◌
`linear_reg()`	✔	✔	✔	✔	×	◌
`logistic_reg()`	✔	✔	✔	✔	×	◌
`mars()`	✔	×	✔	◌	×	◌
`mlp()`	✔	✔	✔	✔	✔	✔
`multinom_reg()`	✔	✔	✔	✔	×²	◌
`naive_Bayes()`	×	✔	✔	◌¹	×	×
`nearest_neighbor()`	✔	✔	✔	◌	✔	✔
`pls()`	✔	✔	✔	×	✔	✔
`poisson_reg()`	✔	✔	✔	✔	×	◌
`rand_forest()`	×	◌	✔²	◌¹	×	×
`rule_fit()`	✔	×	✔	◌¹	✔	×
`svm_*()`	✔	✔	✔	✔	✔	✔

In the table, ✔ indicates that the method is required for the model and × indicates that it is not. The ◌ symbol means that the model may be helped by the technique but it is not required.

Algorithms for tree-based models naturally handle splitting both numeric and categorical predictors. These algorithms employ a series if/then statements that sequentially split the data into groups.

A naive Bayes model will create a cross-tabulation between a categorical predictor and the outcome class. We will return to this point in the final section of this chapter.

Simple categorical variables can also be classified as ordered or unordered.