13.1 The logistic regression model

# Load packages
library(bayesrules)
library(rstanarm)
library(bayesplot)
library(tidyverse)
library(tidybayes)
library(broom.mixed)

Load and process the data

data(weather_perth)
weather <- weather_perth %>% 
  select(day_of_year, raintomorrow, humidity9am, humidity3pm, raintoday)

weather %>% head
## # A tibble: 6 × 5
##   day_of_year raintomorrow humidity9am humidity3pm raintoday
##         <dbl> <fct>              <int>       <int> <fct>    
## 1          45 No                    55          44 No       
## 2          11 No                    43          16 No       
## 3         261 Yes                   62          85 No       
## 4         347 No                    53          46 No       
## 5         357 No                    65          51 No       
## 6         254 No                    84          59 Yes

Predict raintomorrow:

Y={1rain tomorrow0 otherwise

13.1.1 Definition of Odds & probability

odds: the probability that the event happens relative to the probability that it doesn’t happen.

πϵ[0,1]

odds=π1πϵ[0,)

π=odds1+odds

To predict raintomorrow What’s an appropriate model structure for the data?

  • Bernoulli (or, equivalently, Binomial with 1 trial)
  • Gamma
  • Beta

What values can Yi take and what probability models assume this same set of values?

The Bernoulli probability model is the best candidate for the data, given that Yi is a discrete variable which can only take two values, 0 or 1:

Yi|πiBern(πi) E(Yi|πi)=πi

The logistic regression model specify how the expected value of rain πi depends upon predictor Xi1:

g(πi)=β0+β1Xi1

πi depends upon predictor Xi1 through the:

g(πi)=log(πi/(1πi))

πi1πi=eβ0+β1Xi1

πi=eβ0+β1Xi11+eβ0+β1Xi1 β0 is when all predictors are 0 β1=log(oddsx+1)log(oddsx) eβ1=oddsx+11+oddsx

log(odds)=log(π1π)=β0+β1X1+...+βpXp

weather <- weather_perth %>% 
  select(day_of_year, raintomorrow, 
         humidity9am, humidity3pm, raintoday)

weather %>%head
## # A tibble: 6 × 5
##   day_of_year raintomorrow humidity9am humidity3pm raintoday
##         <dbl> <fct>              <int>       <int> <fct>    
## 1          45 No                    55          44 No       
## 2          11 No                    43          16 No       
## 3         261 Yes                   62          85 No       
## 4         347 No                    53          46 No       
## 5         357 No                    65          51 No       
## 6         254 No                    84          59 Yes

Based on assumptions that it will rain tomorrow with a 20% chance.

weather%>%
  count(raintomorrow)%>%
  mutate(prop=n/sum(n)*100)
## # A tibble: 2 × 3
##   raintomorrow     n  prop
##   <fct>        <int> <dbl>
## 1 No             814  81.4
## 2 Yes            186  18.6
weather%>%
  count(raintomorrow)%>%
  mutate(prop=n/sum(n)*100)%>%
  ggplot(aes(x="",prop,fill=raintomorrow))+
  geom_bar(stat="identity", width=1)+
   coord_polar("y", start=0)+
  theme_void()

π=0.2

Assumptions:

  • on an average day, there’s a roughly 20% chance of rain.
  • the chance of rain increases when preceded by a day with high humidity or rain.
weather %>%
  ggplot()+
  geom_line(aes(day_of_year,humidity9am),color="steelblue",linewidth=0.5)+
  geom_line(aes(day_of_year,humidity3pm),color="pink",linewidth=0.5)+
  geom_smooth(aes(day_of_year,humidity9am),color="steelblue")+
  geom_smooth(aes(day_of_year,humidity3pm),color="pink")+
  facet_wrap(vars(raintomorrow))+
  labs(title="raintomorrow?")

weather%>%
  ggplot(aes(raintomorrow,color=raintomorrow))+
  geom_density()

Model specification

  • Data

Yi|β0,β1indBern(πi) log(πi1πi)=β0+β1Xi1

Based on the assumption that it will rain with a 20% chance, the log odds is about -1.4. This is a centered value of a range.

log(0.2/(1-0.2))
## [1] -1.386294

log(πi1πi)1.4

13.1.2 Specifying the priors:

13.1.2.1 Beta 0 centered (β0c)

β0cN(μ,σ)

The range takes consideration that it cannot go below zero, the chance of not rain has a probability of 0. And so it cannot be more than 1. Probability range: [0,1].

We are talking about logs:

log(1)
## [1] 0

Zero is the upper value of the range, for a max value probability of 1:

exp(log(1))
## [1] 1

Range: [? , 0]

1.4+2x=0

x=1.4/2
x
## [1] 0.7
-1.4 + 2*0.7;
## [1] 0
-1.4 - 2*0.7
## [1] -2.8

Range: (-2.8,0]

β0cN(1.4,0.72)

The odds of rain on an average day could be somewhere between 0.06 and 1:

exp(-2.8);exp(0)
## [1] 0.06081006
## [1] 1

(e2.8,e0)(0.06,1)

The probability of rain on an average day could be somewhere between 0.057 and 0.50:

0.06/(1+0.06);
## [1] 0.05660377
1/(1+1)
## [1] 0.5

(0.061+0.06,1(1+1))(0.057,0.5)

13.1.2.2 Beta 1 (β1)

β1N(μ,σ)

With the β1 we want to identify if the chance of rain increases when preceded by a day with high humidity.

Our range in this case has a lower limit of zero, and we estimate the upper limit to be 0.14: Range [0, 0.14]

(e0,e0.14)(1,1.150)

The central value for this range is 0.07:

(0+0.14)/2
## [1] 0.07

Calculate the σ value:

0.07+2x=0.14

x1= (0.14-0.07)/2
x1
## [1] 0.035

So, the Normal distribution parameters are:

  • μ=0.07
  • σ=0.035

β1N(0.07,0.0352)

p1<- plot_normal(mean = -1.4, sd = 0.7) + 
  labs(x = "beta_0c", y = "pdf")
p2 <- plot_normal(mean = 0.07, sd = 0.035) + 
  labs(x = "beta_1", y = "pdf")
library(patchwork)
p1|p2

We simulate 20,000 prior plausible pairs of β0c and β1 to describe a prior plausible relationship between the probability of rain tomorrow and today’s 9 a.m. humidity:

Run a prior simulation:

prior_PD = TRUE
rain_model_prior <- stan_glm(raintomorrow ~ humidity9am,
                             data = weather, 
                             family = binomial,
                             prior_intercept = normal(-1.4, 0.7),
                             prior = normal(0.07, 0.035),
                             chains = 4, iter = 5000*2, seed = 84735,
                             prior_PD = TRUE)

rain_model_prior

And plot just 100 of them:

add_fitted_draws(rain_model_prior, n = 100) 
set.seed(84735)

# Plot 100 prior models with humidity
weather %>% 
  add_fitted_draws(rain_model_prior, n = 100) %>% 
  ggplot(aes(x = humidity9am, y = raintomorrow)) +
    geom_line(aes(y = .value, group = .draw), size = 0.1)

# Plot the observed proportion of rain in 100 prior datasets
weather %>% 
  add_predicted_draws(rain_model_prior, n = 100) %>% 
  group_by(.draw) %>% 
  summarize(proportion_rain = mean(.prediction == 1)) %>% 
  ggplot(aes(x = proportion_rain)) +
    geom_histogram(color = "white")