4.17 Addendum - Logistic Regression Assumptions
library(dplyr)
library(titanic)
library(car)
## Loading required package: carData
Source: The 6 Assumptions of Logistic Regression (With Examples)
Source: Assumptions of Logistic Regression, Clearly Explained
Logistic regression is a method to fit a regression model usually when the response variable is binary.
4.17.0.1 Assumption #1 - The response variable is binary
Examples:
Yes or No
Male or Female
Pass or Fail
For more tha two possible outcomes, an ordinal regression is the model of choice.
4.17.0.2 Assumption #2 - Observations are independent
As the OLS regression, the logistic regression requires that observations are iid (independent and identical distributed).
Easiest way is to create a plot of residuals against time (i.e. order of observations) and observe if the pattern is random or not.
4.17.0.3 Assumption #3 - No multicollinearity among predictors
Multicollinearity occurs when two or more explanatory variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the model.
Use VIF
to check multicollinearity (> 10 show strong collinearity among predcitors)
4.17.0.4 Assumption #4 - No extreme outliers
Logistic regression assumes that there are no extreme outliers or influential observations in the dataset.
Use Cook's distance
for each observation.
4.17.0.5 Assumption #5 - There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable
Logistic regression assumes that there exists a linear relationship between each explanatory variable and the logit of the response variable. Recall that the logit is defined as:
Logit(p) = log(p / (1-p)) where p is the probability of a positive outcome.
Use Box-Tidwell
test to check this assumption.
Example:
<- titanic_train %>%
titanic select(Survived, Age, Fare) %>%
na.omit() %>%
::clean_names()
janitor
glimpse(titanic)
## Rows: 714
## Columns: 3
## $ survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
Build the model
# survived (target ~ age + fare)
<- glm(survived ~ age + fare, data = titanic, family = binomial(link = "logit"))
log_reg
summary(log_reg)
##
## Call:
## glm(formula = survived ~ age + fare, family = binomial(link = "logit"),
## data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.417055 0.185976 -2.243 0.02493 *
## age -0.017578 0.005666 -3.103 0.00192 **
## fare 0.017258 0.002617 6.596 4.23e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 891.34 on 711 degrees of freedom
## AIC: 897.34
##
## Number of Fisher Scoring iterations: 5
Box_Tidwell test
<- titanic %>%
titanic mutate(age_1 = age + 1, fare_1 = fare + 1)
boxTidwell(survived ~ age_1 + fare_1, data = titanic)
## MLE of lambda Score Statistic (t) Pr(>|t|)
## age_1 -0.48631 1.3237 0.186
## fare_1 0.10111 -5.5619 3.782e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## iterations = 10
##
## Score test for null hypothesis that all lambdas = 1:
## F = 18.066, df = 2 and 709, Pr(>F) = 2.225e-08
4.17.0.6 Assumption #6 - Sample size must be sufficiently large
Logistic regression assumes that the sample size of the dataset if large enough to draw valid conclusions from the fitted logistic regression model.
As a rule of thumb, you should have a minimum of 10 cases with the least frequent outcome for each explanatory variable. For example, if you have 3 explanatory variables and the expected probability of the least frequent outcome is 0.20, then you should have a sample size of at least (10*3) / 0.20 = 150.