Notes
What is classification?
Predicting classification a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class.
Some questions that can be solved with classification
- A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
- An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
- On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
Why not linear regression?
- Linear regression should not be used to predict a qualitative or categorical variable with more than 2 levels that does not have a natural ordering
- Linear regression should not be used to predict a qualitative or categorical variable with more than 2 levels that does not have a reasonably similar gap between each level.
- If your qualitative outcome variable only has 2 levels you could recode it as a dummy variable with 0/1 coding. This is not recommmended because the probability estimates will probably not be meaningful, for example have negative probabilities.
- A linear regression model to predict the relationship between
default
andbalance
leads to negative probabilities of default for bank balances close to zero.
Examples of categorical variables that are not appropriate as outcome variables for linear regression?
Logistic Regression
Rather than modeling this response
directly, logistic regression models the probability that belongs to a particular category
Classification using the Default
data. Left: Estimated probability of default
using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for default
(No
or Yes
). Right: Predicted probabilities of default
using logistic regression. All probabilities lie between 0 and 1.
- logistic regression uses a logistic function which gives probability estimates between 0 and 1 for all values of
. - a logistic function will always produce an S-shaped curve, probabilities will come close to but never below zero and close but never above one.
- probability of response
can be predicted based on any value of . in linear regression is the average change in associated with a one-unit increase in . By contrast, in a logistic regression model, increasing by one unit changes the log odds by- regardless of the value of
, if is positive then increasing will be associated with increasing , and if is negative then increasing will be associated with decreasing .
Maximum Likelihood
Using maximum likelihood, the regression coefficients are chosen based on the probability estimates being as close as possible to the observed response of
The estimates
and are chosen to maximize this likelihood function.
Default
data, estimated coefficients of the logistic regression model that predicts the probability of default
using balance
. A one-unit increase in balance
is associated with an increase in the log odds of default
by 0.0055 units.Making Predictions
Once we have our estimated coefficients we can input any value of default
based on any balance
.
Qualitative predictors
Instead of a quantitative predictor like credit balance, we could use a qualitative, or categorical, variable, like whether or not someone is a student, to predict whether or not someone will default.
Multiple Logistic Regression
The equation for simple logistic regression can be rewritten to include coefficient estimates for
Confounding in the Default data. Left: Default rates are shown for students (orange) and non-students (blue). The solid lines display default rate as a function of balance, while the horizontal broken lines display the overall default rates. Right: Boxplots of balance for students (orange) and non-students (blue) are shown.
Multinomial Logistic Regression
This is used in the setting where K > 2 classes. In multinomial, we select a single class to serve as the baseline.
However, the interpretation of the coefficients in a multinomial logistic regression model must be done with care, since it is tied to the choice of baseline.
Alternatively, you can use `Softmax coding, where we treat all K classes symmetrically, and assume that for k = 1, . . . ,K, rather than selecting a baseline. This means, we estimate coefficients for all K classes, rather than estimating coefficients for K − 1 classes.
Generative Models for Classification
model the distribution of the predictors X separately in each of the response classes (i.e. for each value of Y ).
Why not logistic regression?
When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable.
If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the generative modelling may be more accurate than logistic regression.
Generative modelling can be naturally extended to the case of more than two response classes.
3 classifiers to approximate the Bayes classifier
- Linear Discriminant Analysis (LDA) - normal distribution, covariance matrix that is common to all
classes - Quadratic Discriminant Analysis (QDA) - normal distribution, covariance matrix that is NOT common to all
classes - Naive Bayes - uncorrelated predictors, small n, no distribution assumption
Comparison of classification methods
Empirical comparison
6 scenarios to empirically compare the performance of logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), naive Bayes, and K Nearest Neighbors (KNN). Each of the 6 different scenarios was a binary classification problem with 2 quantitative predictors.
Linear Bayes decision boundaries
Scenario 1: 20 training observations for each class, uncorrelated normal distribution predictors
Scenario 2: Similar to scenario 1, but predictors had a correlation of -0.5.
Scenario 3: Substantial negative correlation, and predictors were generated from the t-distribution with 50 trainings observations for each class.
Non-linear Bayes decision boundaries
Scenario 4: normal distribution, correlation of 0.5 between the predictors in the first class, and correlation of 0.5 between the predictors in the second class
Scenario 5: normal distribution with uncorrelated predictors, responses were sampled from the logistic function applied to a complicated non-linear function of the predictors
Scenario 6: normal distribution with a different diagonal covariance matrix for each class and a very small sample size of 6 in each class
Poisson Regression
Why not linear regression?
Similarly to predicting the probability of qualitative variables, negative predictions of count data are not meaningful.
Left: The coefficients associated with the month of the year. Bike usage is highest in the spring and fall, and lowest in the winter. Right: The coefficients associated with the hour of the day. Bike usage is highest during peak commute times, and lowest overnight.
Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers, violating the assumption of homoscedasticity. A smoothing spline fit is shown in green. Right: The log of the number of bikers is now displayed on the y-axis.
Poisson distribution
Poisson distribution is typically used to model counts. When
the larger the mean of
, the larger its variance.
Poisson regression
rather than modeling the number of bikers,
, as a Poisson distribution with a fixed mean value like = 5, we would like to allow the mean to vary as a function of the covariates.
Generalized Linear Models
Linear, logistic and Poisson are three examples of regression models that are known as a generalized linear model (GLM).
Each of them uses predictors
In general, we can perform a regression by modeling the response
as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors… Any regression approach that follows this very general recipe is known as a generalized linear model