Choosing the best model

  • You have to punish models for having too many predictors

  • Whatever the method, \(RSS\) decreases / \(R^2\) increases as we go from \(\mathcal{M}_k\) to \(\mathcal{M}_{k+1}\). Thus, \(\mathcal{M}_p\) always wins that contest.

  • Going with \(\mathcal{M}_p\) doesn’t provide either of the benefits: model interpretability and variance reduction (overfitting)

  • We’ll need to estimate test error!

Adjustment Methods

  • \(C_p = \frac{1}{n}(Rss + 2k\hat{\sigma}^2)\)
  • \(\hat{\sigma}^2\) is an “estimate of variance of the error \(\epsilon\) associated with each response measurement
    • typically estimated using \(\mathcal{M}_p\)
    • if \(p \approx n\) estimate is going to be poor or even zero.
  • \(AIC = 2k - 2ln(\hat{L})\)
  • \(BIC = k \cdot ln(N) - 2ln(\hat{L})\)
  • adjusted \(R^2 = 1 - \frac{RSS}{TSS} \cdot \frac{n-1}{n-k-1}\)

Avoiding Adjustment Methods

  • \(\hat{\sigma}^2\) can be hard to come by
  • adjustment methods make assumptions about true model (e.g. Gaussian errors)
  • so cross-validate!