8.21 Example: Random forests versus bagging (gene expression data)

  • High-dimensional biological data set: contains gene expression measurements of 4,718 genes measured on tissue samples from 349 patients

  • Each of the patient samples has a qualitative label with 15 different levels: Normal or one of 14 different cancer types

  • Want to predict cancer type based on the 500 genes that have the largest variance in the training set

  • Randomly divided the observations into training/test and applied random forests (or bagging) to the training set for 3 different values of \(m\) (the number of predictors available at each split)

Results from random forests for the 15-class gene expression data set with p = 500 predictors. The test error is displayed as a function of the number of trees. Random forests (m < p) lead to a slight improvement over bagging (m = p). A single classification tree has an error rate of 45.7%.

Figure 8.4: Results from random forests for the 15-class gene expression data set with p = 500 predictors. The test error is displayed as a function of the number of trees. Random forests (m < p) lead to a slight improvement over bagging (m = p). A single classification tree has an error rate of 45.7%.