5.5 Encodings for Ordered Data

Suppose that the factors have a relative ordering, like low, medium, and high.

R uses a technique called polynomial contrasts to numerically characterize the relationships.

values <- c("low", "medium", "high")
dat <- data.frame(x = ordered(values, levels = values))

# https://bookdown.org/max/FES/encodings-for-ordered-data.html#tab:categorical-ordered-table
model.matrix(~ x, dat)

##   (Intercept)           x.L        x.Q
## 1           1 -7.071068e-01  0.4082483
## 2           1 -9.073800e-17 -0.8164966
## 3           1  7.071068e-01  0.4082483
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.poly"

# using recipes ----------------------------------------------------------------

# https://bookdown.org/max/FES/encodings-for-ordered-data.html#tab:categorical-ordered-table
recipe(~ x, data = dat) |> 
  step_dummy(x) |> 
  prep() |> bake(new_data = NULL)

## # A tibble: 3 × 2
##         x_1    x_2
##       <dbl>  <dbl>
## 1 -7.07e- 1  0.408
## 2 -9.07e-17 -0.816
## 3  7.07e- 1  0.408

It is important to recognize that patterns described by polynomial contrasts may not effectively relate a predictor to the response. For example, in some cases, one might expect a trend where “low” and “middle” samples have a roughly equivalent response but “high” samples have a much different response. In this case, polynomial contrasts are unlikely to be effective at modeling this trend.

Other alternatives to polynomial contrasts:

Leave the predictors as unordered factors.
Translate the ordered categories into a single set of numeric scores based on context-specific information.

Simple visualizations and context-specific expertise can be used to understand whether either of these approaches are good ideas.