Notes

Introduction: Tree-based methods

  • Involve stratifying or segmenting the predictor space into a number of simple regions
  • Are simple and useful for interpretation
  • However, basic decision trees are NOT competitive with the best supervised learning approaches in terms of prediction accuracy
  • Thus, we also discuss bagging, random forests, and boosting (i.e., tree-based ensemble methods) to grow multiple trees which are then combined to yield a single consensus prediction
  • These can result in dramatic improvements in prediction accuracy (but some loss of interpretability)
  • Can be applied to both regression and classification

Regression Trees

First, let’s take a look at Hitters dataset.


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Rows: 322 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Names, League, Division, NewLeague
dbl (17): AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CAtBat, CHits, CHmRun...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 263 × 5
   Names              Hits Years Salary log_Salary
   <chr>             <dbl> <dbl>  <dbl>      <dbl>
 1 -Alan Ashby          81    14  475         6.16
 2 -Alvin Davis        130     3  480         6.17
 3 -Andre Dawson       141    11  500         6.21
 4 -Andres Galarraga    87     2   91.5       4.52
 5 -Alfredo Griffin    169    11  750         6.62
 6 -Al Newman           37     2   70         4.25
 7 -Argenis Salazar     73     3  100         4.61
 8 -Andres Thomas       81     2   75         4.32
 9 -Andre Thornton      92    13 1100         7.00
10 -Alan Trammell      159    10  517.        6.25
# ℹ 253 more rows

  • For the Hitters data, a regression tree for predicting the log salary of a baseball player based on:

    1. number of years that he has played in the major leagues
    2. number of hits that he made in the previous year

Terminology

The three-region partition for the Hitters data set from the regression tree
  • Overall, the tree stratifies or segments the players into three regions of predictor space:

    • R1 = {X | Years< 4.5}
    • R2 = {X | Years>=4.5, Hits<117.5}
    • R3 = {X | Years>=4.5, Hits>=117.5}

    where R1, R2, and R3 are terminal nodes (leaves) and green lines (where the predictor space is split) are the internal nodes

  • The number in each leaf/terminal node is the mean of the response for the observations that fall there

Interpretation of results: regression tree (Hitters data)

  1. Years is the most important factor in determining Salary: players with less experience earn lower salaries than more experienced players
  2. Given that a player is less experienced, the number of Hits that he made in the previous year seems to play little role in his Salary
  3. But among players who have been in the major leagues for 5 or more years, the number of Hits made in the previous year does affect Salary: players who made more Hits last year tend to have higher salaries
  4. This is surely an over-simplification, but compared to a regression model, it is easy to display, interpret and explain

Tree-building process (regression)

  1. Divide the predictor space — that is, the set of possible values for \(X_1,X_2, . . . ,X_p\) — into \(J\) distinct and non-overlapping regions, \(R_1,R_2, . . . ,R_J\)
  • Regions can have ANY shape - they don’t have to be boxes
  1. For every observation that falls into the region \(R_j\), we make the same prediction: the mean of the response values in \(R_j\)
  2. The goal is to find regions (here boxes) \(R_1, . . . ,R_J\) that minimize the \(RSS\), given by

\[\mathrm{RSS}=\sum_{j=1}^{J}\sum_{i{\in}R_j}^{}(y_i - \hat{y}_{R_j})^2\]

where \(\hat{y}_{R_j}\) is the mean response for the training observations within the \(j\)th box

  • Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into \(J\) boxes.

Recursive binary splitting

So, take a top-down, greedy approach known as recursive binary splitting:

  • top-down because it begins at the top of the tree and then successively splits the predictor space
  • greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step
  1. First, select the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into the regions \({\{X|X_j<s\}}\) and \({\{X|X_j{\ge}s}\}\) leads to the greatest possible reduction in RSS
  2. Repeat the process looking for the best predictor and best cutpoint to split data further (i.e., split one of the 2 previously identified regions - not the entire predictor space) minimizing the RSS within each of the resulting regions
  3. Continue until a stopping criterion is reached, e.g., no region contains more than five observations
  4. Again, we predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs

but …

  • The previous method may result in a tree that overfits the data. Why?
  • Tree is too leafy (complex)
  • A better strategy is to have a smaller tree with fewer splits, which will reduce variance and lead to better interpretation of results (at the cost of a little bias)
  • So we will prune

Pruning a tree

  1. Grow a very large tree \(T_0\) as before
  2. Apply cost-complexity pruning to \(T_0\) to obtain a sequence of BEST subtrees, as a function of \(\alpha\)

Cost complexity pruning minimizes (Eq. 8.4) \(\sum_{m=1}^{|T|}\sum_{x_i{\in}R_m}(y_i-\hat{y}_{R_m})^2 + \alpha|T|\)

where

\(\alpha\) \(\geq\) 0

\(|T|\) is the number of terminal nodes the sub tree \(|T|\) holds

\(R_m\) is the rectangle/region (i.e., the subset of predictor space) corresponding to the \(m\)th terminal node

\(\hat{y}_{R_m}\) is the mean response for the training observations in \(R_m\)

  • the tuning parameter \(\alpha\) controls:

    1. a trade-off between the subtree’s complexity (the number of terminal nodes)
    2. the subtree’s fit to the training data
  1. Choose \(\alpha\) using K-fold cross-validation

    • repeat steps 1) and 2) for each \(K-1/K\)th fraction of training data
    • average the results and pick \(\alpha\) to minimize the average MSE
    • recall that in K-folds cross-validation (say K = 5): the model is estimated on 80% of the data five different times, the predictions are made for the remaining 20%, and the test MSEs are averaged
  2. Return to the subtree from Step 2) that corresponds to the chosen value of \(\alpha\)

An example: tree pruning (Hitters dataset)

  • Results of fitting and pruning a regression tree on the Hitters data using 9 of the features
  • Randomly divided the data set in half (132 observations in training, 131 observations in the test set)
  • Built large regression tree on training data and varied \(\alpha\) in Eq. 8.4 to create subtrees with different numbers of terminal nodes
  • Finally, performed 6-fold cross-validation to estimate the cross-validated MSE of the trees as a function of \(\alpha\)

Training, cross-validation, and test MSE are shown as a function of the number of terminal nodes in the pruned tree. Standard error bands are displayed. The minimum cross-validation error occurs at a tree size of 3.

Classification trees

  • Very similar to a regression tree except it predicts a qualitative (vs quantitative) response
  • We predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs
  • In the classification setting, RSS cannot be used as a criterion for making the binary splits
  • A natural alternative to RSS is the classification error rate, i.e., the fraction of the training observations in that region that do not belong to the most common class:

\[E = 1 - \max_k(\hat{p}_{mk})\]

where \(\hat{p}_{mk}\) is the proportion of training observations in the \(m\)th region that are from the \(k\)th class

  • However, this error rate is unsuited for tree-based classification because \(E\) does not change much as the tree grows (lacks sensitivity)

  • So, 2 other measures are preferable:

    • The Gini Index defined by \[G = \sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})\] is a measure of total variance across the K classes
    • The Gini index takes on a small value if all of the \(\hat{p}_{mk}\)’s are close to 0 or 1
    • For this reason the Gini index is referred to as a measure of node purity - a small value indicates that a node contains predominantly observations from a single class
    • An alternative to the Gini index is cross-entropy given by

    \[D = - \sum_{k=1}^{K}\hat{p}_{mk}\log(\hat{p}_{mk})\]

  • The Gini index and cross-entropy are very similar numerically

Example: classification tree (Heart dataset)

  • Data contain a binary outcome HD (heart disease Y or N based on angiographic test) for 303 patients who presented with chest pain
  • 13 predictors including Age, Sex, Chol (a cholesterol measurement), and other heart and lung function measurements
  • Cross-validation yields a tree with six terminal nodes

Heart data. Top: The unpruned tree. Bottom Left: Cross-validation error, training, and test error, for different sizes of the pruned tree. Bottom Right: The pruned tree corresponding to the minimal cross-validation error.
  • Comment: Classification trees can be constructed if categorical PREDICTORS are present e.g., the first split: Thal is categorical (the ‘a’ in Thal:a indicates the first level of the predictor, i.e. Normal levels)
  • Additionally, notice that some of the splits yield two terminal nodes that have the same predicted value (see red box)
  • Regardless of the value of RestECG, a response value of Yes is predicted for those observations
  • Why is the split performed at all?
    • Because it leads to increased node purity: all 9 of the observations corresponding to the right-hand leaf have a response value of Yes, whereas 7/11 of those corresponding to the left-hand leaf have a response value of Yes
  • Why is node purity important?
    • Suppose that we have a test observation that belongs to the region given by that right-hand leaf. Then we can be pretty certain that its response value is Yes. In contrast, if a test observation belongs to the region given by the left-hand leaf, then its response value is probably Yes, but we are much less certain
  • Even though the split RestECG<1 does not reduce the classification error, it improves the Gini index and the entropy, which are more sensitive to node purity

Advantages/Disadvantages of decision trees

  • Trees can be displayed graphically and are very easy to explain to people
  • They mirror human decision-making
  • Can handle qualitative predictors without the need for dummy variables

but,

  • They do not have the same level of predictive accuracy
  • Can be very non-robust (i.e., a small change in the data can cause large change in the final estimated tree)
  • To improve performance, we can use an ensemble method, which combines many simple ‘buidling blocks’ (i.e., regression or classification trees) to obtain a single and potentially very powerful model
  • ensemble methods include: bagging, random forests, boosting, and Bayesian additive regression trees

Bagging

  • Also known as bootstrap aggregation is a general-purpose procedure for reducing the variance of a statistical learning method

  • It’s useful and frequently used in the context of decision trees

  • Recall that given a set of \(n\) independent observations \(Z_1,..., Z_n\), each with variance \(\sigma^2\), the variance of the mean \(\bar{Z}\) of the observations is given by \(\sigma^2/n\)

  • So, averaging a set of observations reduces variance

  • But, this is not practical because we generally do not have access to multiple training sets!

  • What can we do?

  • Cue the bootstrap, i.e., take repeated samples from the single training set

  • Generate \(B\) different bootstrapped training data set

  • Then train our method on the \(b\)th bootstrapped training set to get \(\hat{f}^{*b}\), the prediction at a point x

  • Average all the predictions to obtain \[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x)\]

  • In the case of classification trees:

    • for each test observation:
      • record the class predicted by each of the \(B\) trees
      • take a majority vote: the overall prediction is the most commonly occurring class among the \(B\) predictions

Comment: The number of trees \(B\) is not a critical parameter with bagging - a large \(B\) will not lead to overfitting

Out-of-bag error estimation

  • But how do we estimate the test error of a bagged model?
  • It’s pretty straightforward:
    1. Because trees are repeatedly fit to bootstrapped subsets of observations, on average each bagged tree uses about 2/3 of the observations
    2. The leftover 1/3 not used to fit a given bagged tree are called out-of-bag (OOB) observations
    3. We can predict the response for the \(i\)th observation using each of the trees in which that observation was OOB. Gives around B/3 predictions for the \(i\)th observation (which we then average)
    4. This estimate is essentially the LOO cross-validation error for bagging (if \(B\) is large)

Variable importance measures

  • Bagging results in improved accuracy over prediction using a single tre
  • But, it can be difficult to interpret the resulting model:
    • we can’t represent the statistical learning procedure using a single tree
    • it’s not clear which variables are most important to the procedure (i.e., we have many trees each of which may give a differing view on the importance of a given predictor)
  • So, which predictors are important?
    • An overall summary of the importance of each predictor can be achieved by recording how much the average \(RSS\) or Gini index improves (or decreases) when each tree is split over a given predictor (averaged over all \(B\) trees)
      • a large value = important predictor

A variable importance plot for the Heart data. Variable importance is computed using the mean decrease in Gini index, and expressed relative to the maximum.

Random forests

  • A problem with bagging is that bagged trees may be highly similar to each other.
  • For example, if there is a strong predictor in the data set, most of the bagged trees will use this strong predictor in the top split so that
    • the trees will look quite similar
    • predictions from the bagged trees will be highly correlated
  • Averaging many highly correlated quantities does not lead to as large a reduction in variance as averaging many uncorrelated quantities

Random forests: advantages over bagging

  • Random forests overcome this problem by forcing each split to consider only a subset of the predictors (typically a random sample \(m \approx \sqrt{p}\))
  • Thus at each split, the algorithm is NOT ALLOWED to consider a majority of the available predictors (essentially \((p - m)/p\) of the splits will not even consider the strong predictor, giving other predictors a chance)
  • This decorrelates the trees and makes the average of the resulting trees less variable (more reliable)
  • Only difference between bagging and random forests is the choice of predictor subset size \(m\) at each split: if a random forest is built using \(m = p\) that’s just bagging
  • For both, we build a number of decision trees on bootstrapped training samples

Example: Random forests versus bagging (gene expression data)

  • High-dimensional biological data set: contains gene expression measurements of 4,718 genes measured on tissue samples from 349 patients
  • Each of the patient samples has a qualitative label with 15 different levels: Normal or one of 14 different cancer types
  • Want to predict cancer type based on the 500 genes that have the largest variance in the training set
  • Randomly divided the observations into training/test and applied random forests (or bagging) to the training set for 3 different values of \(m\) (the number of predictors available at each split)

Results from random forests for the 15-class gene expression data set with p = 500 predictors. The test error is displayed as a function of the number of trees. Random forests (m < p) lead to a slight improvement over bagging (m = p). A single classification tree has an error rate of 45.7%.

Boosting

  • Yet another approach to improve prediction accuracy from a decision tree
  • Can also be applied to many statistical learning methods for regression or classification
  • Recall that in bagging each tree is built on a bootstrap training data set
  • In boosting, each tree is grown sequentially using information from previously grown trees:
    • given the current model, we fit a decision tree to the residuals of the model (rather than the outcome Y) as the response
    • we then add this new decision tree into the fitted function (model) in order to update the residuals
    • Why? this way each tree is built on information that the previous trees were unable to ‘catch’

Boosting algorithm

where:

\(\hat{f}(x)\) is the decision tree (model)

\(r\) = residuals

\(d\) = number of splits in each tree (controls the complexity of the boosted ensemble)

\(\lambda\) = shrinkage parameter (a small positive number that controls the rate at which boosting learns; typically 0.01 or 0.001 but right choice can depend on the problem)

  • Each of the trees can be small, with just a few terminal nodes (determined by \(d\))
  • By fitting small trees to the residuals, we slowly improve our model (\(\hat{f}\)) in areas where it doesn’t perform well
  • The shrinkage parameter \(\lambda\) slows the process down further, allowing more and different shaped trees to ‘attack’ the residuals
  • Unlike bagging and random forests, boosting can OVERFIT if \(B\) is too large. \(B\) is selected via cross-validation

Example: Boosting versus random forests

Results from performing boosting and random forests on the 15-class gene expression data set in order to predict cancer versus normal. The test error is displayed as a function of the number of trees. For the two boosted models, lambda = 0.01. Depth-1 trees slightly outperform depth-2 trees, and both outperform the random forest, although the standard errors are around 0.02, making none of these differences significant. The test error rate for a single tree is 24 %.
  • Notice that because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are typically sufficient in boosting (versus random forests)
  • Random forests and boosting are among the state-of-the-art methods for supervised learning (but, their results can be difficult to interpret)

Bayesian additive regression trees (BART)

  • Recall that in bagging and random forests, each tree is built on a random sample of data and/or predictors and each tree is built independently of the others
  • BART is related to both - what is new is HOW the new trees are generated
  • NOTE: only BART for regression is described in the book

BART notation

  • Let \(K\) be the total number of regression trees and
  • \(B\) be the number of iterations the BART algorithm will run for
  • Let \(\hat{f}^b_k(x)\) be the prediction at \(x\) for the \(k\)th regression tree used in the \(b\)th iteration of the BART algorithm
  • At the end of each iteration, the \(K\) trees from that iteration will be summed:

\[\hat{f}^b(x) = \sum_{k=1}^{K}\hat{f}^b_k(x)\] for \(b=1,...,B\)

BART algorithm

  • In the first iteration of the BART algorithm, all \(K\) trees are initialized to have 1 root node, with \(\hat{f}^1_k(x) = \frac{1}{nK}\sum_{i=1}^{n}y_i\)
    • i.e., the mean of the response values divided by the total number of trees
  • Thus, for the first iteration (\(b = 1\)), the prediction for all \(K\) trees is just the mean of the response

\(\hat{f}^1(x) = \sum_{k=1}^K\hat{f}^1_k(x) = \sum_{k=1}^K\frac{1}{nK}\sum_{i=1}^{n}y_i = \frac{1}{n}\sum_{i=1}^{n}y_i\)

BART algorithm: iteration 2 and on

  • In subsequent iterations, BART updates each of the \(K\) trees one at a time
  • In the \(b\)th iteration to update the \(k\)th tree, we subtract from each response value the predictions from all but the \(k\)th tree, to obtain a partial residual:

\(r_i = y_i - \sum_{k'<k}\hat{f}^b_{k'}(x_i) - \sum_{k'>k}\hat{f}^{b-1}_{k'}(x_i)\)

for the \(i\)th observation, \(i = 1, …, n\)

  • Rather than fitting a new tree to this partial residual, BART chooses a perturbation to the tree from a previous iteration \(\hat{f}^{b-1}_{k}\) favoring perturbations that improve the fit to the partial residual
  • To perturb trees:
    • change the structure of the tree by adding/pruning branches
    • change the prediction in each terminal node of the tree
  • The output of BART is a collection of prediction models:

\(\hat{f}^b(x) = \sum_{k=1}^{K}\hat{f}^b_k(x)\)

for \(b = 1, 2,…, B\)

BART algorithm: figure

  • Comment: the first few prediction models obtained in the earlier iterations (known as the \(burn-in\) period; denoted by \(L\)) are typically thrown away since they tend to not provide very good results, like you throw away the first pancake of the batch

BART: additional details

  • A key element of BART is that a fresh tree is NOT fit to the current partial residual: instead, we improve the fit to the current partial residual by slightly modifying the tree obtained in the previous iteration (Step 3(a)ii)
  • This guards against overfitting since it limits how “hard” the data is fit in each iteration
  • Additionally, the individual trees are typically pretty small
  • BART, as the name suggests, can be viewed as a Bayesian approach to fitting an ensemble of trees:
    • each time a tree is randomly perturbed to fit the residuals = drawing a new tree from a posterior distribution

To apply BART:

  • We must select the number of trees \(K\), the number of iterations \(B\) and the number of burn-in iterations \(L\)
  • Typically, large values are chosen for \(B\) and \(K\) and a moderate value for \(L\): e.g. \(K\) = 200, \(B\) = 1,000 and \(L\) = 100
  • BART has been shown to have impressive out-of-box performance - i.e., it performs well with minimal tuning