6.4 Considerations in High Dimensions

Figure 6.22
- Data sets containing more features p than observations n are often referred to as high-dimensional.
- Modern data can have a huge number of predictors (eg: 500k SNPs, every word ever entered in a search)
- When n<=p, linear regression memorizes the training data, but can suck on test data.

Figure 6.23 - simulated data set with n = 20 training observations, all unrelated to outcome.
Lasso (etc) vs Dimensionality
- Reducing flexibility (all the stuff in this chapter) can help.
- It’s important to choose good tuning parameters for whatever method you use.
- Features that aren’t associated with Y increase test error (“curse
of dimensionality”).
- Fit to noise in training, noise in test is different.
- When p>n, never use train MSE, p-values, R2, etc, as evidence of goodness of fit because they’re likely to be wildly different from test values.

Figure 6.24