Ridge Regression

Ridge regression is very similar to least squares, except that the coefcients are estimated by minimizing a slightly diferent quantity
$\hat{\beta}^{OLS} \equiv \underset{\hat{\beta}}{argmin}(RSS)$
$\hat{\beta}^R \equiv \underset{\hat{\beta}}{argmin}(RSS+\lambda\sum_{k=1}^p{\beta_k^2})$
$\lambda$ tuning parameter (hyperparameter) for the shrinkage penalty
there’s one model parameter $\lambda$ doesn’t shrink
- ( $\hat{\beta_0}$ )

Ridge Regression, Visually

$\|\beta\|_2 = \sqrt{\sum_{j=1}^p{\beta_j^2}}$

Note the decrease in test MSE, and further that this is not computationally expensive: “One can show that computations required to solve (6.5), simultaneously for all values of $\lambda$ , are almost identical to those for fitting a model using least squares.”

Preprocessing

Note that $\beta_j^R$ aren’t scale invariant, so: $\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_i^n{(x_{ij} - \bar{x}_j)^2}}}$

It is best to apply ridge regression after standardizing the predictors, using the formula above