5.7 Factors versus Dummy Variables in Tree-Based Models
Certain types of models have the ability to use categorical data in its natural form.
For example, a Chicago ridership decision tree could split on
if day in {Sun, Sat} then ridership = 4.4K
else ridership = 17.3K
Suppose the day of the week had been converted to dummy variables. What would have occurred? In this case, the model is slightly more complex since it can only create rules as a function of a single dummy variable at a time:
if day = Sun then ridership = 3.84K
else if day = Sat then ridership = 4.96K
else ridership = 17.30K
So, for decision trees and naiive bayes
does it matter how the categorical features are encoded?
To answer this question, a series of experiments was conducted.
The results:
For these data sets, there is no real difference in the area under the ROC curve between the encoding methods. In terms of performance, it appears that differences between the two encodings are rare (but can occur).
One other statistic was computed for each of the simulations: the time to train the models.
Here, there is very strong trend that factor-based models are more efficiently trained than their dummy variable counterparts.
One other effect of how qualitative predictors are encoded is related to summary measures. Many of these techniques, especially tree-based models, calculate variable importance scores that are relative measures for how much a predictor affected the outcome. For example, trees measure the effect of a specific split on the improvement in model performance (e.g., impurity, residual error, etc.). As predictors are used in splits, these improvements are aggregated; these can be used as the importance scores.