13.1 The logistic regression model

# Load packages
library(bayesrules)
library(rstanarm)
library(bayesplot)
library(tidyverse)
library(tidybayes)
library(broom.mixed)

Load and process the data

data(weather_perth)
weather <- weather_perth %>% 
  select(day_of_year, raintomorrow, humidity9am, humidity3pm, raintoday)

weather %>% head
## # A tibble: 6 × 5
##   day_of_year raintomorrow humidity9am humidity3pm raintoday
##         <dbl> <fct>              <int>       <int> <fct>    
## 1          45 No                    55          44 No       
## 2          11 No                    43          16 No       
## 3         261 Yes                   62          85 No       
## 4         347 No                    53          46 No       
## 5         357 No                    65          51 No       
## 6         254 No                    84          59 Yes

Predict raintomorrow:

\[Y=\left\{\begin{matrix} 1 & \text{rain tomorrow}\\ 0 & \text{ otherwise} \end{matrix}\right.\]

13.1.1 Definition of Odds & probability

odds: the probability that the event happens relative to the probability that it doesn’t happen.

\[\pi \epsilon [0,1]\]

\[odds=\frac{\pi}{1-\pi} \epsilon [0,\infty)\]

\[\pi=\frac{odds}{1+odds}\]

To predict raintomorrow What’s an appropriate model structure for the data?

  • Bernoulli (or, equivalently, Binomial with 1 trial)
  • Gamma
  • Beta

What values can \(Y_i\) take and what probability models assume this same set of values?

The Bernoulli probability model is the best candidate for the data, given that \(Y_i\) is a discrete variable which can only take two values, 0 or 1:

\[Y_i|\pi_i \sim Bern(\pi_i)\] \[E(Y_i|\pi_i)=\pi_i\]

The logistic regression model specify how the expected value of rain \(\pi_i\) depends upon predictor \(X_{i1}\):

\[g(\pi_i)=\beta_0+\beta_1 X_{i1}\]

\(\pi_i\) depends upon predictor \(X_{i1}\) through the:

\[g(\pi_i)=log(\pi_i/(1-\pi_i))\]

\[\frac{\pi_i}{1-\pi_i}=e^{\beta_0+\beta_1 X_{i1}}\]

\[\pi_i=\frac{e^{\beta_0+\beta_1 X_{i1}}}{1+e^{\beta_0+\beta_1 X_{i1}}}\] \(\beta_0\) is when all predictors are 0 \(\beta_1=log(odds_{x+1})-log(odds_x)\) \(e^{\beta_1}=\frac{odds_{x+1}}{1+odds_x}\)

\[log(odds)=log(\frac{\pi}{1-\pi})=\beta_0+\beta_1 X_{1}+...+\beta_p X_p\]

weather <- weather_perth %>% 
  select(day_of_year, raintomorrow, 
         humidity9am, humidity3pm, raintoday)

weather %>%head
## # A tibble: 6 × 5
##   day_of_year raintomorrow humidity9am humidity3pm raintoday
##         <dbl> <fct>              <int>       <int> <fct>    
## 1          45 No                    55          44 No       
## 2          11 No                    43          16 No       
## 3         261 Yes                   62          85 No       
## 4         347 No                    53          46 No       
## 5         357 No                    65          51 No       
## 6         254 No                    84          59 Yes

Based on assumptions that it will rain tomorrow with a 20% chance.

weather%>%
  count(raintomorrow)%>%
  mutate(prop=n/sum(n)*100)
## # A tibble: 2 × 3
##   raintomorrow     n  prop
##   <fct>        <int> <dbl>
## 1 No             814  81.4
## 2 Yes            186  18.6
weather%>%
  count(raintomorrow)%>%
  mutate(prop=n/sum(n)*100)%>%
  ggplot(aes(x="",prop,fill=raintomorrow))+
  geom_bar(stat="identity", width=1)+
   coord_polar("y", start=0)+
  theme_void()

\[\pi= 0.2\]

Assumptions:

  • on an average day, there’s a roughly 20% chance of rain.
  • the chance of rain increases when preceded by a day with high humidity or rain.
weather %>%
  ggplot()+
  geom_line(aes(day_of_year,humidity9am),color="steelblue",linewidth=0.5)+
  geom_line(aes(day_of_year,humidity3pm),color="pink",linewidth=0.5)+
  geom_smooth(aes(day_of_year,humidity9am),color="steelblue")+
  geom_smooth(aes(day_of_year,humidity3pm),color="pink")+
  facet_wrap(vars(raintomorrow))+
  labs(title="raintomorrow?")

weather%>%
  ggplot(aes(raintomorrow,color=raintomorrow))+
  geom_density()

Model specification

  • Data

\[Y_i|\beta_0,\beta_1 \sim^{ind} Bern(\pi_i)\] \[log(\frac{\pi_i}{1-\pi_i})=\beta_0+\beta_1 X_{i1}\]

Based on the assumption that it will rain with a 20% chance, the log odds is about -1.4. This is a centered value of a range.

log(0.2/(1-0.2))
## [1] -1.386294

\[log(\frac{\pi_i}{1-\pi_i}) \approx -1.4\]

13.1.2 Specifying the priors:

13.1.2.1 Beta 0 centered (\(\beta_{0c}\))

\[\beta_{0c} \sim N(\mu,\sigma)\]

The range takes consideration that it cannot go below zero, the chance of not rain has a probability of 0. And so it cannot be more than 1. Probability range: [0,1].

We are talking about logs:

log(1)
## [1] 0

Zero is the upper value of the range, for a max value probability of 1:

exp(log(1))
## [1] 1

Range: [? , 0]

\[-1.4 + 2x=0\]

x=1.4/2
x
## [1] 0.7
-1.4 + 2*0.7;
## [1] 0
-1.4 - 2*0.7
## [1] -2.8

Range: (-2.8,0]

\[\beta_{0c} \sim N(-1.4,0.7^2)\]

The odds of rain on an average day could be somewhere between 0.06 and 1:

exp(-2.8);exp(0)
## [1] 0.06081006
## [1] 1

\[(e^{-2.8},e^0) \approx (0.06,1)\]

The probability of rain on an average day could be somewhere between 0.057 and 0.50:

0.06/(1+0.06);
## [1] 0.05660377
1/(1+1)
## [1] 0.5

\[(\frac{0.06}{1+0.06},\frac{1}{(1+1)}) \approx (0.057,0.5)\]

13.1.2.2 Beta 1 (\(\beta_1\))

\[\beta_1 \sim N(\mu,\sigma)\]

With the \(\beta_1\) we want to identify if the chance of rain increases when preceded by a day with high humidity.

Our range in this case has a lower limit of zero, and we estimate the upper limit to be 0.14: Range [0, 0.14]

\[(e^0,e^{0.14}) \approx (1,1.150)\]

The central value for this range is 0.07:

(0+0.14)/2
## [1] 0.07

Calculate the \(\sigma\) value:

\[0.07 + 2x = 0.14\]

x1= (0.14-0.07)/2
x1
## [1] 0.035

So, the Normal distribution parameters are:

  • \(\mu = 0.07\)
  • \(\sigma=0.035\)

\[\beta_{1} \sim N(0.07,0.035^2)\]

p1<- plot_normal(mean = -1.4, sd = 0.7) + 
  labs(x = "beta_0c", y = "pdf")
p2 <- plot_normal(mean = 0.07, sd = 0.035) + 
  labs(x = "beta_1", y = "pdf")
library(patchwork)
p1|p2

We simulate 20,000 prior plausible pairs of \(\beta_{0c}\) and \(\beta_1\) to describe a prior plausible relationship between the probability of rain tomorrow and today’s 9 a.m. humidity:

Run a prior simulation:

prior_PD = TRUE
rain_model_prior <- stan_glm(raintomorrow ~ humidity9am,
                             data = weather, 
                             family = binomial,
                             prior_intercept = normal(-1.4, 0.7),
                             prior = normal(0.07, 0.035),
                             chains = 4, iter = 5000*2, seed = 84735,
                             prior_PD = TRUE)

rain_model_prior

And plot just 100 of them:

add_fitted_draws(rain_model_prior, n = 100) 
set.seed(84735)

# Plot 100 prior models with humidity
weather %>% 
  add_fitted_draws(rain_model_prior, n = 100) %>% 
  ggplot(aes(x = humidity9am, y = raintomorrow)) +
    geom_line(aes(y = .value, group = .draw), size = 0.1)

# Plot the observed proportion of rain in 100 prior datasets
weather %>% 
  add_predicted_draws(rain_model_prior, n = 100) %>% 
  group_by(.draw) %>% 
  summarize(proportion_rain = mean(.prediction == 1)) %>% 
  ggplot(aes(x = proportion_rain)) +
    geom_histogram(color = "white")