2.2 Assessing Model Accuracy
There is no free lunch in statistics: no one method dominates all others over all possible problems.
Selecting the best approach can be challenging in practice.
2.2.1 Measuring Quality of Fit
There is no free lunch in statistics: no one method dominates all others over all possible data sets. On a particular data set, one specific method may work best, but some other method may work better on a similar but different data set.
MSE
We always need some way to measure how well a model’s predictions actually match the observed data.
In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by \[ MSE = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{f}(x_i))^2,\]
The MSE will be small if the predicted responses are very close to the true responses,and will be large if for some of the observations, the predicted and true responses differ substantially.
Training vs. Test
The MSE in the above equation is computed using the training data that was used to fit the model, and so should more accurately be referred to as the training MSE.
In general, we do not really care how well the method works on the training data.
We are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.
\[\mathrm{Ave}(y_0 - \hat{f}(x_0))^2 ,\]
We’d like to select the model for which this quantity is as small as possible on unseen, future samples.
The degrees of freedom is a quantity that summarizes the flexibility of a curve. The training MSE declines monotonically as flexibility increases.
Overfitting
As model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data.
Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.
MSE, for a given value, can always be decomposed into the sum of three fundamental quantities: the variance of \(\hat{f}(x_0)\), the squared bias of \(\hat{f}(x_0)\) and the variance of the error terms \(\epsilon\). That is,
\[E\big(y_0 - \hat{f}(x_0)\big)^2 = \mathrm{Var}\big(\hat{f}(x_0)\big) +[\mathrm{Bias}\big(\hat{f}(x_0)\big)]^2 + \mathrm{Var}(\epsilon)\]
Here the notation \(E\big(y_0 - \hat{f}(x_0)\big)^2\) defines the expected test MSE at \(x_0\) and refers to the average test MSE that we would obtain if we repeatedly estimated \(f\) using a large number of training sets, and tested each at \(x_0\). The overall expected test MSE can be computed by averaging \(E \big(y_0 - \hat{f}x(x_0)\big)^2\) over all possible values of \(x_0\) in the test set
The orange, blue and green squares indicate the MSEs associated with the corresponding curves in the left hand panel. A more restricted and hence smoother curve has fewer degrees of freedom than a wiggly curve—note that in Figure 2.9, linear regression is at the most restrictive end, with two degrees of freedom. The training MSE declines monotonically as flexibility increases.
As the flexibility of the statistical learning method increases, we observe a monotone decrease in the training MSE and a U-shape in the test MSE. This is a fundamental property of statistical learning that holds regardless of the particular data set at hand and regardless of the statistical method being used. As model flexibility increases, training MSE will decrease, but the test MSE may not.
When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data.
Another example in which the true \(f\) is approximately linear.
However, because the truth is close to linear, the test MSE only decreases slightly before increasing again, so that the orange least squares fit is substantially better than the highly flexible green curve.
Figure 2.11 displays an example in which f is highly non-linear. The training and test MSE curves still exhibit the same general patterns, but now there is a rapid decrease in both curves before the test MSE starts to increase slowly.
We need to select a statistical learning method that simultaneously achieves low variance and low bias.
2.2.2 The Bias-Variance Trade-Off
As we use more flexible methods, the variance will increase and the bias will decrease. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.
In all three cases, the variance increases and the bias decreases as the method’s flexibility increases. The relationship between bias, variance, and test set MSE given is referred to as the bias-variance trade-off.
The challenge lies in finding a method for which both the variance and the squared bias are low. This trade-off is one of the most important recurring themes in this book.
2.2.3 The Classification Setting
The most common approach for quantifying the accuracy of our estimate \(\hat{f}\) is the training error rate, the proportion of mistakes that are made if we apply our estimate \(\hat{f}\) to the training observations: \[\frac{1}{n}\sum_{i=1}^{n}I(y_i \ne \hat{y}_i).\]
The above equation computes the fraction of incorrect classifications.
The equation is referred to as the training error rate because it is computed based on the data that was used to train our classifier.
Again, we are most interested in the error rates that result from applying our classifier to test observations that were not used in training.
Test Error
The test error rate associated with a set of test observations of the form \((x_0, y_0)\) is given by
\[\mathrm{Ave}\big(I(y_i \ne \hat{y}_i)\big).\]
Where \(\hat{y}_0\) is the predicted class label that results from applying the classifier to the test observation with predictor \(x_0\). A good classifier is one for which the test error is smallest.
The Bayes Classifier
Hypothetical – cannot be done in practice
The test error rate is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values. In other words, we should simply assign a test observation with predictor vector \(x_0\) to the class \(j\) for which
\[\mathrm{Pr}(Y=j|X=x_0).\]
Note that is a conditional probability: it is the probability that \(Y = j\), given the observed predictor vector \(X_0\). This very simple classifier is called the Bayes classifier.
Bayes Classifier Decision Boundary
Figure 2.13 provides an example using a simulated data set in a two dimensional space consisting of predictors X1 and X2. The orange and blue circles correspond to training observations that belong to two different classes. For each value of X1 and X2, there is a different probability of the response being orange or blue.
The purple dashed line represents the points where the probability is exactly 50%. This is called the Bayes decision boundary. An observation that falls on the orange side of the boundary will be assigned to the orange class, and similarly an observation on the blue side of the boundary will be assigned to the blue class.
The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate.
The overall Bayes error rate is given by
\[1- E\big(\mathop{\mathrm{max}}_{j}(Y=j|X)\big),\]
where the expectation averages the probability over all possible values of X. The Bayes error rate is analogous to the irreducible error, discussed earlier.
K-Nearest Neighbors
It is a classifier of unlabeled example which needs to be classified into one of the several labeled groups
For real data, we do not know the conditional distribution of \(Y\) given \(X\), and so computing the Bayes classifier is impossible.
Many approaches attempt to estimate the conditional distribution of \(Y\) given \(X\), and then classify a given observation to the class with highest estimated probability. One such method is the K-nearest neighbors (KNN) classifier.
Given a positive integer K and a test observation \(t_{0}\), the KNN classifier first identifies the K points in the training data that are closest to \(x_0\)
\[\mathrm{Pr}(Y=j|X=x_0) = \frac{1}{K}\sum_{i \in \mathcal{N}_0} I (y_i = j)\]
Finally, KNN classifies the test observation \(x_0\) to the class with the largest probability.
The KNN approach with \(K\) = 3 at all of the possible values for \(X_1\) and \(X_2\), and have drawn in the corresponding KNN decision boundary.
Despite the fact that it is a very simple approach, KNN can produce classifiers that are surprisingly close to the optimal Bayes classifier.
KNN with Different K
The choice of K has a drastic effect on the KNN classifier obtained.
Nearest neighbor methods can be lousy when \(p\) is large.
Reason: the curse of dimensionality. Nearest neighbors tend to be far away in high dimensions.
KNN Tuning
As we use more flexible classification methods, the training error rate will decline but the test error rate may not.
As \(1/K\) increases, the method becomes more flexible. As in the regression setting, the training error rate consistently declines as the flexibility increases.
However, the test error exhibits a characteristic U-shape, declining at first (with a minimum at approximately \(K\) = 10) before increasing again when the method becomes excessively flexible and overfits.