13.1 The logistic regression model
# Load packages
library(bayesrules)
library(rstanarm)
library(bayesplot)
library(tidyverse)
library(tidybayes)
library(broom.mixed)
Load and process the data
data(weather_perth)
<- weather_perth %>%
weather select(day_of_year, raintomorrow, humidity9am, humidity3pm, raintoday)
%>% head weather
## # A tibble: 6 × 5
## day_of_year raintomorrow humidity9am humidity3pm raintoday
## <dbl> <fct> <int> <int> <fct>
## 1 45 No 55 44 No
## 2 11 No 43 16 No
## 3 261 Yes 62 85 No
## 4 347 No 53 46 No
## 5 357 No 65 51 No
## 6 254 No 84 59 Yes
Predict raintomorrow
:
Y={1rain tomorrow0 otherwise
13.1.1 Definition of Odds & probability
odds
: the probability that the event happens relative
to the probability that it doesn’t happen.
πϵ[0,1]
odds=π1−πϵ[0,∞)
π=odds1+odds
To predict raintomorrow
What’s an appropriate model structure for the data?
- Bernoulli (or, equivalently, Binomial with 1 trial)
- Gamma
- Beta
What values can Yi take and what probability models assume this same set of values?
The Bernoulli probability model
is the best candidate for the data, given that Yi is a discrete variable which can only take two values, 0 or 1:
Yi|πi∼Bern(πi) E(Yi|πi)=πi
The logistic regression model
specify how the expected value of rain πi depends upon predictor Xi1:
g(πi)=β0+β1Xi1
πi depends upon predictor Xi1 through the:
g(πi)=log(πi/(1−πi))
πi1−πi=eβ0+β1Xi1
πi=eβ0+β1Xi11+eβ0+β1Xi1 β0 is when all predictors are 0 β1=log(oddsx+1)−log(oddsx) eβ1=oddsx+11+oddsx
log(odds)=log(π1−π)=β0+β1X1+...+βpXp
<- weather_perth %>%
weather select(day_of_year, raintomorrow,
humidity9am, humidity3pm, raintoday)
%>%head weather
## # A tibble: 6 × 5
## day_of_year raintomorrow humidity9am humidity3pm raintoday
## <dbl> <fct> <int> <int> <fct>
## 1 45 No 55 44 No
## 2 11 No 43 16 No
## 3 261 Yes 62 85 No
## 4 347 No 53 46 No
## 5 357 No 65 51 No
## 6 254 No 84 59 Yes
Based on assumptions that it will rain tomorrow with a 20% chance.
%>%
weathercount(raintomorrow)%>%
mutate(prop=n/sum(n)*100)
## # A tibble: 2 × 3
## raintomorrow n prop
## <fct> <int> <dbl>
## 1 No 814 81.4
## 2 Yes 186 18.6
%>%
weathercount(raintomorrow)%>%
mutate(prop=n/sum(n)*100)%>%
ggplot(aes(x="",prop,fill=raintomorrow))+
geom_bar(stat="identity", width=1)+
coord_polar("y", start=0)+
theme_void()
π=0.2
Assumptions:
- on an average day, there’s a roughly 20% chance of rain.
- the chance of rain increases when preceded by a day with high humidity or rain.
%>%
weather ggplot()+
geom_line(aes(day_of_year,humidity9am),color="steelblue",linewidth=0.5)+
geom_line(aes(day_of_year,humidity3pm),color="pink",linewidth=0.5)+
geom_smooth(aes(day_of_year,humidity9am),color="steelblue")+
geom_smooth(aes(day_of_year,humidity3pm),color="pink")+
facet_wrap(vars(raintomorrow))+
labs(title="raintomorrow?")
%>%
weatherggplot(aes(raintomorrow,color=raintomorrow))+
geom_density()
Model specification
- Data
Yi|β0,β1∼indBern(πi) log(πi1−πi)=β0+β1Xi1
Based on the assumption that it will rain with a 20% chance, the log odds is about -1.4. This is a centered value of a range.
log(0.2/(1-0.2))
## [1] -1.386294
log(πi1−πi)≈−1.4
13.1.2 Specifying the priors:
13.1.2.1 Beta 0 centered (β0c)
β0c∼N(μ,σ)
The range takes consideration that it cannot go below zero, the chance of not rain has a probability of 0. And so it cannot be more than 1. Probability range: [0,1].
We are talking about logs:
log(1)
## [1] 0
Zero is the upper value of the range, for a max value probability of 1:
exp(log(1))
## [1] 1
Range: [? , 0]
−1.4+2x=0
=1.4/2
x x
## [1] 0.7
-1.4 + 2*0.7;
## [1] 0
-1.4 - 2*0.7
## [1] -2.8
Range: (-2.8,0]
β0c∼N(−1.4,0.72)
The odds of rain
on an average day could be somewhere between 0.06 and 1:
exp(-2.8);exp(0)
## [1] 0.06081006
## [1] 1
(e−2.8,e0)≈(0.06,1)
The probability of rain
on an average day could be somewhere between 0.057 and 0.50:
0.06/(1+0.06);
## [1] 0.05660377
1/(1+1)
## [1] 0.5
(0.061+0.06,1(1+1))≈(0.057,0.5)
13.1.2.2 Beta 1 (β1)
β1∼N(μ,σ)
With the β1 we want to identify if the chance of rain increases
when preceded by a day with high humidity
.
Our range in this case has a lower limit of zero, and we estimate the upper limit to be 0.14: Range [0, 0.14]
(e0,e0.14)≈(1,1.150)
The central value for this range is 0.07:
0+0.14)/2 (
## [1] 0.07
Calculate the σ value:
0.07+2x=0.14
= (0.14-0.07)/2
x1 x1
## [1] 0.035
So, the Normal distribution parameters are:
- μ=0.07
- σ=0.035
β1∼N(0.07,0.0352)
<- plot_normal(mean = -1.4, sd = 0.7) +
p1labs(x = "beta_0c", y = "pdf")
<- plot_normal(mean = 0.07, sd = 0.035) +
p2 labs(x = "beta_1", y = "pdf")
library(patchwork)
|p2 p1
We simulate 20,000 prior plausible pairs of β0c and β1 to describe a prior plausible relationship between the probability of rain tomorrow and today’s 9 a.m. humidity
:
Run a prior simulation:
prior_PD = TRUE
<- stan_glm(raintomorrow ~ humidity9am,
rain_model_prior data = weather,
family = binomial,
prior_intercept = normal(-1.4, 0.7),
prior = normal(0.07, 0.035),
chains = 4, iter = 5000*2, seed = 84735,
prior_PD = TRUE)
rain_model_prior
And plot just 100 of them:
add_fitted_draws(rain_model_prior, n = 100)
set.seed(84735)
# Plot 100 prior models with humidity
%>%
weather add_fitted_draws(rain_model_prior, n = 100) %>%
ggplot(aes(x = humidity9am, y = raintomorrow)) +
geom_line(aes(y = .value, group = .draw), size = 0.1)
# Plot the observed proportion of rain in 100 prior datasets
%>%
weather add_predicted_draws(rain_model_prior, n = 100) %>%
group_by(.draw) %>%
summarize(proportion_rain = mean(.prediction == 1)) %>%
ggplot(aes(x = proportion_rain)) +
geom_histogram(color = "white")