Choosing the best model
You have to punish models for having too many predictors
Whatever the method, \(RSS\) decreases / \(R^2\) increases as we go from \(\mathcal{M}_k\) to \(\mathcal{M}_{k+1}\). Thus, \(\mathcal{M}_p\) always wins that contest.
Going with \(\mathcal{M}_p\) doesn’t provide either of the benefits: model interpretability and variance reduction (overfitting)
We’ll need to estimate test error!
Adjustment Methods
- \(C_p = \frac{1}{n}(Rss + 2k\hat{\sigma}^2)\)
- \(\hat{\sigma}^2\) is an “estimate of variance of the error \(\epsilon\) associated with each response measurement
- typically estimated using \(\mathcal{M}_p\)
- if \(p \approx n\) estimate is going to be poor or even zero.
- \(AIC = 2k - 2ln(\hat{L})\)
- \(BIC = k \cdot ln(N) - 2ln(\hat{L})\)
- adjusted \(R^2 = 1 - \frac{RSS}{TSS} \cdot \frac{n-1}{n-k-1}\)