Chapter 5 Encoding Categorical Predictors
Categorical (also called nominal) predictors are those that contain qualitative data.
Examples include:
- Education level
- ZIP code
- Text
- Day of the week
- Color
A large majority of models require that all predictors be numeric.
A summary of parsnip model preprocessors from: Tidy Modeling with R by Max Kuhn and Julia Silge
knitr::opts_chunk$set(fig.path = "images/")
suppressPackageStartupMessages({
library(tidyverse)
library(tidymodels)
library(embed)
library(cli)
library(kableExtra)
})| model | dummy | zv | impute | decorrelate | normalize | transform |
|---|---|---|---|---|---|---|
| bag_mars() | ✔ | × | ✔ | ◌ | × | ◌ |
| bag_tree() | × | × | × | ◌¹ | × | × |
| bart() | × | × | × | ◌¹ | × | × |
| boost_tree() | ײ | ◌ | ✔² | ◌¹ | × | × |
| C5_rules() | × | × | × | × | × | × |
| cubist_rules() | × | × | × | × | × | × |
| decision_tree() | × | × | × | ◌¹ | × | × |
| discrim_flexible() | ✔ | × | ✔ | ✔ | × | ◌ |
| discrim_linear() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
| discrim_regularized() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
| gen_additive_mod() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
| linear_reg() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
| logistic_reg() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
| mars() | ✔ | × | ✔ | ◌ | × | ◌ |
| mlp() | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
| multinom_reg() | ✔ | ✔ | ✔ | ✔ | ײ | ◌ |
| naive_Bayes() | × | ✔ | ✔ | ◌¹ | × | × |
| nearest_neighbor() | ✔ | ✔ | ✔ | ◌ | ✔ | ✔ |
| pls() | ✔ | ✔ | ✔ | × | ✔ | ✔ |
| poisson_reg() | ✔ | ✔ | ✔ | ✔ | × | ◌ |
| rand_forest() | × | ◌ | ✔² | ◌¹ | × | × |
| rule_fit() | ✔ | × | ✔ | ◌¹ | ✔ | × |
| svm_*() | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
In the table, ✔ indicates that the method is required for the model and × indicates that it is not. The ◌ symbol means that the model may be helped by the technique but it is not required.
Algorithms for tree-based models naturally handle splitting both numeric and categorical predictors. These algorithms employ a series if/then statements that sequentially split the data into groups.
A naive Bayes model will create a cross-tabulation between a categorical predictor and the outcome class. We will return to this point in the final section of this chapter.
Simple categorical variables can also be classified as ordered or unordered.
