6.4 Considerations in High Dimensions | Introduction to Statistical Learning Using R Book Club

Processing math: 100%

6.4 Considerations in High Dimensions

Figure 6.22

Data sets containing more features $p$ than observations $n$ are often referred to as high-dimensional.
Modern data can have a huge number of predictors (eg: 500k SNPs, every word ever entered in a search)
When $n <= p$ , linear regression memorizes the training data, but can suck on test data.

Figure 6.23 - simulated data set with n = 20 training observations, all unrelated to outcome.

Lasso (etc) vs Dimensionality

Reducing flexibility (all the stuff in this chapter) can help.
It’s important to choose good tuning parameters for whatever method you use.
Features that aren’t associated with $Y$ increase test error (“curse of dimensionality”).
- Fit to noise in training, noise in test is different.
When $p > n$ , never use train MSE, p-values, $R^2$ , etc, as evidence of goodness of fit because they’re likely to be wildly different from test values.

Figure 6.24