4.17 Addendum - Logistic Regression Assumptions

library(dplyr)
library(titanic)
library(car)
## Loading required package: carData

Source: The 6 Assumptions of Logistic Regression (With Examples)

Source: Assumptions of Logistic Regression, Clearly Explained

Logistic regression is a method to fit a regression model usually when the response variable is binary.

4.17.0.1 Assumption #1 - The response variable is binary

Examples:

  • Yes or No

  • Male or Female

  • Pass or Fail

For more tha two possible outcomes, an ordinal regression is the model of choice.

4.17.0.2 Assumption #2 - Observations are independent

As the OLS regression, the logistic regression requires that observations are iid (independent and identical distributed).

Easiest way is to create a plot of residuals against time (i.e. order of observations) and observe if the pattern is random or not.

4.17.0.3 Assumption #3 - No multicollinearity among predictors

Multicollinearity occurs when two or more explanatory variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the model.

Use VIF to check multicollinearity (> 10 show strong collinearity among predcitors)

4.17.0.4 Assumption #4 - No extreme outliers

Logistic regression assumes that there are no extreme outliers or influential observations in the dataset.

Use Cook's distance for each observation.

4.17.0.5 Assumption #5 - There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable

Logistic regression assumes that there exists a linear relationship between each explanatory variable and the logit of the response variable. Recall that the logit is defined as:

Logit(p) = log(p / (1-p)) where p is the probability of a positive outcome.

Use Box-Tidwell test to check this assumption.

Example:

titanic <- titanic_train %>% 
     select(Survived, Age, Fare) %>% 
     na.omit() %>% 
     janitor::clean_names()

glimpse(titanic)
## Rows: 714
## Columns: 3
## $ survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…

Build the model

# survived (target  ~ age + fare)
log_reg <- glm(survived ~ age + fare, data = titanic, family = binomial(link = "logit"))

summary(log_reg)
## 
## Call:
## glm(formula = survived ~ age + fare, family = binomial(link = "logit"), 
##     data = titanic)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.417055   0.185976  -2.243  0.02493 *  
## age         -0.017578   0.005666  -3.103  0.00192 ** 
## fare         0.017258   0.002617   6.596 4.23e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 891.34  on 711  degrees of freedom
## AIC: 897.34
## 
## Number of Fisher Scoring iterations: 5

Box_Tidwell test

titanic <- titanic %>% 
     mutate(age_1 = age + 1, fare_1 = fare + 1)

boxTidwell(survived ~ age_1 + fare_1, data = titanic)
##        MLE of lambda Score Statistic (t)  Pr(>|t|)    
## age_1       -0.48631              1.3237     0.186    
## fare_1       0.10111             -5.5619 3.782e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## iterations =  10 
## 
## Score test for null hypothesis that all lambdas = 1:
## F = 18.066, df = 2 and 709, Pr(>F) = 2.225e-08

4.17.0.6 Assumption #6 - Sample size must be sufficiently large

Logistic regression assumes that the sample size of the dataset if large enough to draw valid conclusions from the fitted logistic regression model.

As a rule of thumb, you should have a minimum of 10 cases with the least frequent outcome for each explanatory variable. For example, if you have 3 explanatory variables and the expected probability of the least frequent outcome is 0.20, then you should have a sample size of at least (10*3) / 0.20 = 150.