6.3.1 Example: Cholera

Deaths due to cholera over 12 months in 1849 (from the {HistData} package):

library(tidyverse)
library(HistData)

cholera <- HistData::CholeraDeaths1849 %>%
  filter(cause_of_death == "Cholera") %>%
  select(date, deaths)

cholera %>% head()
## # A tibble: 6 × 2
##   date       deaths
##   <date>      <dbl>
## 1 1849-01-01     13
## 2 1849-01-02     19
## 3 1849-01-03     28
## 4 1849-01-04     24
## 5 1849-01-05     23
## 6 1849-01-06     39

Response (y) = deaths.

Predictor (x) = date.

How to model this?

Review the relationship between response and predictor:

  • Equation of a line: y=β0+β1x or (y=mx+b)

  • β0 - intercept, or value of y when x=0

  • β1 - average change in y for each unit increase in x

  • x - value of the predictor

Often there are multiple predictors, adding more complexity to a model.

Modelling estimates these values

Linear models and Locally estimated scatter plot smoothing (LOESS)

Linear model

  • Assumes linear relationship.

Locally estimated scatter plot smoothing (LOESS)

  • No underlying assumptions made about the structure of the data.

Comparing linear and non-linear models

Plotting a linear model against a generalised additive model (GAM):

GAM - Generalised Additive Model

What are they?

  • Flexible extensions of linear models.
  • Useful for modelling complex relationships between response (y) and predictor(s) (x1,...xn).

g(μ)=β0+f(x)

  • Predictors are transformed through each function xf(x).
  • Link function g(μ) is chosen based on response distribution.

How to do this in R?

# Load the mgcv package
library(mgcv)

# Use the days instead of the full date
cholera$days <- row_number(cholera)

# Look at the first six rows
head(cholera)
## # A tibble: 6 × 3
##   date       deaths  days
##   <date>      <dbl> <int>
## 1 1849-01-01     13     1
## 2 1849-01-02     19     2
## 3 1849-01-03     28     3
## 4 1849-01-04     24     4
## 5 1849-01-05     23     5
## 6 1849-01-06     39     6
# Fit a GAM using the gam() function
gam_model <- gam(deaths ~ s(days), data = cholera)

# Print the summary of the GAM model
summary(gam_model)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## deaths ~ s(days)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  146.008      3.583   40.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(days) 8.969      9 431.4  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.914   Deviance explained = 91.6%
## GCV = 4818.7  Scale est. = 4687      n = 365

Key points:

  • Intercept is modelled as a parametric term.
  • Days is a smooth term: s(days).
  • Effective degrees of freedom (EDF) is ~9, indicating a wiggly (non-linear) relationship to deaths.
  • Adjusted R-squared value indicates the model explains approx. 91.4% of variance in y.
  • Deviance value of 91.6% indicates a good fit.
plot(gam_model)

However, the model is very simple with only one predictor. In reality there are likely to be many other factors that influence number of deaths.