Notes

What is statistical learning?

Statistical learning is the theoretical foundation for machine learning framework. It makes connections between the fields of statistics, linear algebra and functional analysis.

In particular, the Statistical learning theory deals with the problem of finding a predictive function based on data and this is what is best known as supervised learning, in this book we will see more than just theory, as we will deal with unsupervised learning as well as making practical applications.

  • Supervised: “Building a model to predict an output from inputs.
    • Predict wage from age, education, and year.
    • Predict market direction from previous days' performance.
  • Unsupervised: Inputs but no specific outputs, find relationships and structure.
    • Identify clusters within cancer cell lines.

Why ISLP?

  • “Facilitate the transition of statistical learning from an academic to a mainstream field.”
  • Machine learning* is useful to everyone, let’s all learn enough to use it responsibly.
  • Python “labs” make this make sense for this community!

Premises of ISLP

From Page 9 of the Introduction:

  • “Many statistical learning methods are relevant and useful in a wide range of academic and non-academic disciplines, beyond just the statistical sciences.”
  • “Statistical learning should not be viewed as a series of black boxes.”
  • “While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box!”
  • “We presume that the reader is interested in applying statistical learning methods to real-world problems.”

Notation

  • n = number of observations (rows)
  • p = number of features/variables (columns)
  • We’ll come back here if we need to as we go!
  • Some symbols they assume we know:
    • \(\in\) = “is an element of”, “in”
    • \({\rm I\!R}\) = “real numbers”

What have we gotten ourselves into?

An Introduction to Statistical Learning (ISL by James, Witten, Hastie and Tibshiraniis), is a collection of modern statistical methods for modeling and making predictions from real-world data.

It is a middle way between theoretical statistics and the practice of applying statistics to real-world problems.

It can be considered as a user manual, with self-contained Python labs, which lead you through the use of different methods for applying statistical analysis to different kinds of data.

  • 2: Terminology & main concepts
  • 3-4: Classic linear methods
  • 5: Resampling (so we can choose the best method)
  • 6: Modern updates to linear methods
  • 7+: Beyond Linearity (we can worry about details as we get there)

Where’s the data?

pip install ISLP

We’ll look at this data in more detail below.

Some useful resources:

Some more theoretical resources:

  • The Elements of Statistical Learning (ESL, by Hastie, Tibshirani, and Friedman) ESLII

What is covered in the book?

The book provides a series of toolkits classified as supervised or unsupervised techniques for understanding data.

The second edition of the book (2021) contains additions within the most updated statistical analysis.

Editions

How is the book divided?

The book is divided into 13 chapters covering:

  • Introduction and Statistical Learning:
    • Supervised Versus Unsupervised Learning
    • Regression Versus Classification Problems

Linear statistical learning

  • Linear Regression:
    • basic concepts
    • introduction of K-nearest neighbor classifier
  • Classification:
    • logistic regression
    • linear discriminant analysis
  • Resampling Methods:
    • cross-validation
    • the bootstrap
  • Linear Model Selection and Regularization: potential improvements over standard linear regression
    • stepwise selection
    • ridge regression
    • principal components regression
    • the lasso

Non-linear statistical learning

  • Moving Beyond Linearity:

    • Polynomial Regression
    • Regression Spline
    • Smoothing Splines
    • Local Regression
    • Generalized Additive Models
  • Tree-Based Methods:

    • Decision Trees
    • Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees
  • Support Vector Machines (linear and non-linear classification)

  • Deep Learning (non-linear regression and classification)

  • Survival Analysis and Censored Data

  • Unsupervised Learning:

    • Principal components analysis
    • K-means clustering
    • Hierarchical clustering
  • Multiple Testing

Each chapter includes 1 self-contained R lab on the topic

Some examples of the problems addressed with statistical analysis

  • Identify the risk factors for some type of cancers
  • Predict whether someone will have a hearth attack on the basis of demographic, diet, and clinical measurements
  • Email spam detection
  • Classify a tissue sample into one of several cancer classes, based on a gene expression profile
  • Establish the relationship between salary and demographic variables in population survey data

(source)

Datasets provided in the ISLP package

The book provides the ISLP Python package with all the datasets needed the analysis.

Datasets in ISLP package