Notes

What is Statistical Learning?

In this chapter will deal with developing an accurate model that can be used to predict some value.

Notation:

  • Input variables: X1,,Xp
    Also known as predictors, features, independent variables.
  • Output variable: Y
    Also known as response or dependent variable.

We assume there is some relationship between Y and X=(X1,,Xp), which we write as:

Y=f(X)+ϵ

, where ϵ is a random error term which is independent from X and has mean zero; and, f represents the systematic information that X provides about Y .

Income data set

In essence, statistical learning deals with different approaches to estimate f .

Why estimate f?

Two main reasons to estimate f:

Prediction

  • Predict Y using a set of inputs X .

  • Representation: Y^=f^(X), where f^ represents our estimate for f, and Y^ our prediction for Y .*

  • In this setting, f^ is often treated as a black-box, meaning we don’t mind not knowing the exact form of f^, if it generates accurate predictions for Y .

  • Y^’s accuracy depends on:

    • Reducible error
      • Due to f^ not being a perfect estimate for f.
      • Can be reduced by using a proper statistical learning technique.
    • Irreducible error
      • Due to ϵ and its variability.
      • ϵ is independent from X, so no matter how well we estimate f, we can’t reduce this error.
  • The quantity ϵ may contain unmeasured variables useful for predicting Y; or, may contain unmeasure variation, so no prediction model will be perfect.

  • Mathematical form, after choosing predictors X and an estimate f^:

E(YY^)2=E(f(X)+ϵf^(X))2=[f(X)f^(X)]2reducible+ Var(ϵ)irreducible.

In practice, we almost always don’t know how ϵ’s variability affects our model, so, in this boook, we will focus on techniques for estimating f .

Inference

In this case, we are interested in understanding the association between Y and X1,,Xp.

  • For example:
    • Which predictors are most associated with response?
    • What is the relationship between the response and each predictor?
    • Can such relationship be summarized via a linear equation, or is it more complex?

The exact form of f^ is required.

Linear models allow for easier interpretability, but can lack in prediction accuracy; while, non-linear models can be more accurate, but less interpretable.

How do we estimate f ?

  • First, let’s agree on some conventions:

    • n : Number of observations.
    • xij: Value of the jth predictor, for ith observation.
    • yi : Response variable for ith observation.
    • Training data:
      • Set of observations.
      • Used to esmitate f.
      • {(x1,y1),,(xn,yn)}, where xi=(xi1,,xip)T .
  • Goal: Find a function f^ such that Yf^(X) for any observation (X,Y) .

  • Most statistical methods for achieving this goal can be characterized as either parametric or non-parametric.

Parametric methods

  • Steps:

    1. Make an assumption about the form of f.
      It could be linear (f(X)=β0+β1X1++βpXp, parameters β0,,βp to be estimated) or not.
    2. The model has been selected.
      Now, we need a procedure to fit the model using the training data.
      The most common of such fitting procedures is called (ordinary) least squares.
  • Via these steps, the problem of estimating f has been reduced to a problem of estimating a set of parameters.

  • We can make the models more flexible via considering a greater number of parameters, but, this can lead to overfitting the data, that is, following the errors/noise too closely, which will not yield accurate estimates of the response for observations outside of the original training data.

Non-parametric methods

  • No assumptions about the form of f are made.
  • Instead, we seek an estimate of f which that gets as close to the data point as possible.
  • Has the potential to fit a wider range of possible forms for f.
  • Tipically requires a very large number of observations (compared to paramatric approach) in order to accurately estimate f.

The trade-off between prediction accuracy and model interpretability

We’ve seen that parametric models are usually restrictive; and, non-parametric models, flexible. However:

  • Restrictive models are usually more interpretable, so they are useful for inference.
  • Flexible models can be difficult to interpret, due to the complexity of f^.

Despite this, we will often obtain more accurate predictions usinf a less flexible method, due to the potential for overfitting the data in highly flexible models.

Supervised vs Unsupervised Learning

In supervised learning, we wish to fit a model that relates inputs/predictors to some output.

In unsupervised learning, we lack a reponse/variable to predict. Instead, we seek to understand the relationships between the variables or between the observations.

There are instances where a mix of such methods is required (semi-supervised learning problems), but such topic will not be covered in this book.

Regression vs Classification problems

  • If the response is
    • Quantitative, then, it’s a regression problem.
    • Categorical, then, it’s a classification problem.
  • Most of the methods covered in this book can be applied regardless of the predictor variable type, but the categorical variables will require some pre-processing.

Assessing model accuracy

  • There is no best method for Statistical Learning, the method’s efficacy can depend on the data set.

  • For a specific data set, how do we select the best Statistics approach?

Measuring the quality of fit

  • The performance of a statistical learning method can be evaluated comparing the predictions of the model, with their true/real response.

  • Most commonly used measure for this:

    • Mean squared error
    •  MSE =1ni=1n(yif^(xi))2
    • Small MSE means that the predicted and the true responses are very close.
  • We want the model to accurately predict unseen data (testing data), not so much the training data, where the response is already known.

  • The best model will be the one which produces the lowest test MSE, not the lowest training MSE.

  • It’s not true that the model with lowest training MSE will also have the lowest test MSE.

Training MSE vs Test MSE
  • Fundamental property: For any data set and any statistical learning method used, as the flexibility of the statistical learning method increases:
    • The training MSE decreases monotonically.
    • The test MSE graph has a U-shape.

As model flexibility increases, training MSE will decrease, but the test MSE may not.

  • Small training MSE but big test MSE implies having overfitted the data.

  • Regardless of overfitting or not, we almost always expect training MSE < testing MSE , beacuse most statistical learning methods seek to minimize the training MSE.

  • Estimating test MSE is very difficult, usually because lack of data. Later in this book, we’ll discuss approaches to estimate the mininum point for the test MSE curve.

The Bias-Variance Trade-off

  • Definition: The expected test MSE at x0 (E(y0f^(x0))2) refers to the averga test MSE that we would obtain after repeatedly estimating f using a large number of training sets, and tested each esimate at x0.

  • Definition: The variance of a statistical learning method which produces an estimate f^ refers to how the estimate function changes, for different training sets.

  • Definition: Bias refers to the error generated by approximating a possibly complicated model (like in real-life usually), by a much simpler one … (how f and the possibles f^ differ).

  • As a general rule, the more flexible a statistical method, the higher its variance and lower its bias.

  • For any given value x0, the following can be proved:

E(y0f^(x0))2=Var(f^(x0))+Bias(f^(x0))2+ Var (ϵ)

  • Due to variance and squared bias being non negative, the previous equation implies that, to minimize the expected test error, we require a statistical learnig method which achieves low variance and low bias.

  • The tradeoff:

    • Extremely low bias but high variance: For example, draw a line which passes over every single point in the training data.
    • Extremely low variance but high bias: For example, fit a horizontal line to the data.
    • The challenge lies in finding a method for which both the variance and the squared bias are low.

  • In a real-life situation, f is usually unkwon, so it’s not possible to explicitly compute the test MSE, bias or variance of a statistical method.

  • The test MSE can be estimated using cross-validation, but we’ll discuss it later in this book.

The Classification setting

Let’s see how the concepts recently discussed change when we the prediction is a categorical variable.

The most common approach for quantifying the accuracy of our estimate f^ is the training error rate, the proportion of mistakes made by applying f^ to the training observations:

1ni=1nI(yiy^i)

, where I is 1 when yi=y^i, and 0 otherwise.

  • The test error rate is defined as  Average(I(yiy^i)), where the average is computed by comparing the predictions y^i with the true response yi.

  • A good classifier is one for which the test error is smallest.