5.5 Encodings for Ordered Data
Suppose that the factors have a relative ordering, like low
, medium
, and high
.
R uses a technique called polynomial contrasts
to numerically characterize the relationships.
<- c("low", "medium", "high")
values <- data.frame(x = ordered(values, levels = values))
dat
# https://bookdown.org/max/FES/encodings-for-ordered-data.html#tab:categorical-ordered-table
model.matrix(~ x, dat)
## (Intercept) x.L x.Q
## 1 1 -7.071068e-01 0.4082483
## 2 1 -9.073800e-17 -0.8164966
## 3 1 7.071068e-01 0.4082483
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.poly"
# using recipes ----------------------------------------------------------------
# https://bookdown.org/max/FES/encodings-for-ordered-data.html#tab:categorical-ordered-table
recipe(~ x, data = dat) |>
step_dummy(x) |>
prep() |> bake(new_data = NULL)
## # A tibble: 3 × 2
## x_1 x_2
## <dbl> <dbl>
## 1 -7.07e- 1 0.408
## 2 -9.07e-17 -0.816
## 3 7.07e- 1 0.408
It is important to recognize that patterns described by polynomial contrasts may not effectively relate a predictor to the response. For example, in some cases, one might expect a trend where “low” and “middle” samples have a roughly equivalent response but “high” samples have a much different response. In this case, polynomial contrasts are unlikely to be effective at modeling this trend.
Other alternatives to polynomial contrasts:
- Leave the predictors as unordered factors.
- Translate the ordered categories into a single set of numeric scores based on context-specific information.
Simple visualizations and context-specific expertise can be used to understand whether either of these approaches are good ideas.