Chapter 5 Encoding Categorical Predictors

Categorical (also called nominal) predictors are those that contain qualitative data.

Examples include:

  • Education level
  • ZIP code
  • Text
  • Day of the week
  • Color

A large majority of models require that all predictors be numeric.

A summary of parsnip model preprocessors from: Tidy Modeling with R by Max Kuhn and Julia Silge

knitr::opts_chunk$set(fig.path = "images/")

suppressPackageStartupMessages({
library(tidyverse)
library(tidymodels)
library(embed)
library(cli)
library(kableExtra)
})
Table 5.1: Preprocessing methods for different models.
model dummy zv impute decorrelate normalize transform
bag_mars() × ×
bag_tree() × × × ◌¹ × ×
bart() × × × ◌¹ × ×
boost_tree() ײ ✔² ◌¹ × ×
C5_rules() × × × × × ×
cubist_rules() × × × × × ×
decision_tree() × × × ◌¹ × ×
discrim_flexible() × ×
discrim_linear() ×
discrim_regularized() ×
gen_additive_mod() ×
linear_reg() ×
logistic_reg() ×
mars() × ×
mlp()
multinom_reg() ײ
naive_Bayes() × ◌¹ × ×
nearest_neighbor()
pls() ×
poisson_reg() ×
rand_forest() × ✔² ◌¹ × ×
rule_fit() × ◌¹ ×
svm_*()

In the table, ✔ indicates that the method is required for the model and × indicates that it is not. The symbol means that the model may be helped by the technique but it is not required.

Algorithms for tree-based models naturally handle splitting both numeric and categorical predictors. These algorithms employ a series if/then statements that sequentially split the data into groups.

A naive Bayes model will create a cross-tabulation between a categorical predictor and the outcome class. We will return to this point in the final section of this chapter.

Simple categorical variables can also be classified as ordered or unordered.