8.13 Example: classification tree (Heart data)

  • Data contain a binary outcome HD (heart disease Y or N based on angiographic test) for 303 patients who presented with chest pain

  • 13 predictors including Age, Sex, Chol (a cholesterol measurement), and other heart and lung function measurements

  • Cross-validation yields a tree with six terminal nodes

Heart data. Top: The unpruned tree. Bottom Left: Cross-validation error, training, and test error, for different sizes of the pruned tree. Bottom Right: The pruned tree corresponding to the minimal cross-validation error.

Figure 8.2: Heart data. Top: The unpruned tree. Bottom Left: Cross-validation error, training, and test error, for different sizes of the pruned tree. Bottom Right: The pruned tree corresponding to the minimal cross-validation error.

  • NOTE: classification trees can be constructed if categorical PREDICTORS are present e.g., the first split: Thal is categorical (the ‘a’ in Thal:a indicates the first level of the predictor, i.e. Normal levels)

  • Additionally, notice that some of the splits yield two terminal nodes that have the same predicted value (see red box)

  • Regardless of the value of RestECG, a response value of Yes is predicted for those observations

  • Why is the split performed at all?

    • Because it leads to increased node purity: all 9 of the observations corresponding to the right-hand leaf have a response value of Yes, whereas 7/11 of those corresponding to the left-hand leaf have a response value of Yes
  • Why is node purity important?

    • Suppose that we have a test observation that belongs to the region given by that right-hand leaf. Then we can be pretty certain that its response value is Yes. In contrast, if a test observation belongs to the region given by the left-hand leaf, then its response value is probably Yes, but we are much less certain
  • Even though the split RestECG<1 does not reduce the classification error, it improves the Gini index and the entropy, which are more sensitive to node purity