13.1 The logistic regression model
# Load packages
library(bayesrules)
library(rstanarm)
library(bayesplot)
library(tidyverse)
library(tidybayes)
library(broom.mixed)
Load and process the data
data(weather_perth)
<- weather_perth %>%
weather select(day_of_year, raintomorrow, humidity9am, humidity3pm, raintoday)
%>% head weather
## # A tibble: 6 × 5
## day_of_year raintomorrow humidity9am humidity3pm raintoday
## <dbl> <fct> <int> <int> <fct>
## 1 45 No 55 44 No
## 2 11 No 43 16 No
## 3 261 Yes 62 85 No
## 4 347 No 53 46 No
## 5 357 No 65 51 No
## 6 254 No 84 59 Yes
Predict raintomorrow
:
\[Y=\left\{\begin{matrix} 1 & \text{rain tomorrow}\\ 0 & \text{ otherwise} \end{matrix}\right.\]
13.1.1 Definition of Odds & probability
odds
: the probability that the event happens relative
to the probability that it doesn’t happen.
\[\pi \epsilon [0,1]\]
\[odds=\frac{\pi}{1-\pi} \epsilon [0,\infty)\]
\[\pi=\frac{odds}{1+odds}\]
To predict raintomorrow
What’s an appropriate model structure for the data?
- Bernoulli (or, equivalently, Binomial with 1 trial)
- Gamma
- Beta
What values can \(Y_i\) take and what probability models assume this same set of values?
The Bernoulli probability model
is the best candidate for the data, given that \(Y_i\) is a discrete variable which can only take two values, 0 or 1:
\[Y_i|\pi_i \sim Bern(\pi_i)\] \[E(Y_i|\pi_i)=\pi_i\]
The logistic regression model
specify how the expected value of rain \(\pi_i\) depends upon predictor \(X_{i1}\):
\[g(\pi_i)=\beta_0+\beta_1 X_{i1}\]
\(\pi_i\) depends upon predictor \(X_{i1}\) through the:
\[g(\pi_i)=log(\pi_i/(1-\pi_i))\]
\[\frac{\pi_i}{1-\pi_i}=e^{\beta_0+\beta_1 X_{i1}}\]
\[\pi_i=\frac{e^{\beta_0+\beta_1 X_{i1}}}{1+e^{\beta_0+\beta_1 X_{i1}}}\] \(\beta_0\) is when all predictors are 0 \(\beta_1=log(odds_{x+1})-log(odds_x)\) \(e^{\beta_1}=\frac{odds_{x+1}}{1+odds_x}\)
\[log(odds)=log(\frac{\pi}{1-\pi})=\beta_0+\beta_1 X_{1}+...+\beta_p X_p\]
<- weather_perth %>%
weather select(day_of_year, raintomorrow,
humidity9am, humidity3pm, raintoday)
%>%head weather
## # A tibble: 6 × 5
## day_of_year raintomorrow humidity9am humidity3pm raintoday
## <dbl> <fct> <int> <int> <fct>
## 1 45 No 55 44 No
## 2 11 No 43 16 No
## 3 261 Yes 62 85 No
## 4 347 No 53 46 No
## 5 357 No 65 51 No
## 6 254 No 84 59 Yes
Based on assumptions that it will rain tomorrow with a 20% chance.
%>%
weathercount(raintomorrow)%>%
mutate(prop=n/sum(n)*100)
## # A tibble: 2 × 3
## raintomorrow n prop
## <fct> <int> <dbl>
## 1 No 814 81.4
## 2 Yes 186 18.6
%>%
weathercount(raintomorrow)%>%
mutate(prop=n/sum(n)*100)%>%
ggplot(aes(x="",prop,fill=raintomorrow))+
geom_bar(stat="identity", width=1)+
coord_polar("y", start=0)+
theme_void()
\[\pi= 0.2\]
Assumptions:
- on an average day, there’s a roughly 20% chance of rain.
- the chance of rain increases when preceded by a day with high humidity or rain.
%>%
weather ggplot()+
geom_line(aes(day_of_year,humidity9am),color="steelblue",linewidth=0.5)+
geom_line(aes(day_of_year,humidity3pm),color="pink",linewidth=0.5)+
geom_smooth(aes(day_of_year,humidity9am),color="steelblue")+
geom_smooth(aes(day_of_year,humidity3pm),color="pink")+
facet_wrap(vars(raintomorrow))+
labs(title="raintomorrow?")
%>%
weatherggplot(aes(raintomorrow,color=raintomorrow))+
geom_density()
Model specification
- Data
\[Y_i|\beta_0,\beta_1 \sim^{ind} Bern(\pi_i)\] \[log(\frac{\pi_i}{1-\pi_i})=\beta_0+\beta_1 X_{i1}\]
Based on the assumption that it will rain with a 20% chance, the log odds is about -1.4. This is a centered value of a range.
log(0.2/(1-0.2))
## [1] -1.386294
\[log(\frac{\pi_i}{1-\pi_i}) \approx -1.4\]
13.1.2 Specifying the priors:
13.1.2.1 Beta 0 centered (\(\beta_{0c}\))
\[\beta_{0c} \sim N(\mu,\sigma)\]
The range takes consideration that it cannot go below zero, the chance of not rain has a probability of 0. And so it cannot be more than 1. Probability range: [0,1].
We are talking about logs:
log(1)
## [1] 0
Zero is the upper value of the range, for a max value probability of 1:
exp(log(1))
## [1] 1
Range: [? , 0]
\[-1.4 + 2x=0\]
=1.4/2
x x
## [1] 0.7
-1.4 + 2*0.7;
## [1] 0
-1.4 - 2*0.7
## [1] -2.8
Range: (-2.8,0]
\[\beta_{0c} \sim N(-1.4,0.7^2)\]
The odds of rain
on an average day could be somewhere between 0.06 and 1:
exp(-2.8);exp(0)
## [1] 0.06081006
## [1] 1
\[(e^{-2.8},e^0) \approx (0.06,1)\]
The probability of rain
on an average day could be somewhere between 0.057 and 0.50:
0.06/(1+0.06);
## [1] 0.05660377
1/(1+1)
## [1] 0.5
\[(\frac{0.06}{1+0.06},\frac{1}{(1+1)}) \approx (0.057,0.5)\]
13.1.2.2 Beta 1 (\(\beta_1\))
\[\beta_1 \sim N(\mu,\sigma)\]
With the \(\beta_1\) we want to identify if the chance of rain increases
when preceded by a day with high humidity
.
Our range in this case has a lower limit of zero, and we estimate the upper limit to be 0.14: Range [0, 0.14]
\[(e^0,e^{0.14}) \approx (1,1.150)\]
The central value for this range is 0.07:
0+0.14)/2 (
## [1] 0.07
Calculate the \(\sigma\) value:
\[0.07 + 2x = 0.14\]
= (0.14-0.07)/2
x1 x1
## [1] 0.035
So, the Normal distribution parameters are:
- \(\mu = 0.07\)
- \(\sigma=0.035\)
\[\beta_{1} \sim N(0.07,0.035^2)\]
<- plot_normal(mean = -1.4, sd = 0.7) +
p1labs(x = "beta_0c", y = "pdf")
<- plot_normal(mean = 0.07, sd = 0.035) +
p2 labs(x = "beta_1", y = "pdf")
library(patchwork)
|p2 p1
We simulate 20,000 prior plausible pairs of \(\beta_{0c}\) and \(\beta_1\) to describe a prior plausible relationship between the probability of rain tomorrow and today’s 9 a.m. humidity
:
Run a prior simulation:
prior_PD = TRUE
<- stan_glm(raintomorrow ~ humidity9am,
rain_model_prior data = weather,
family = binomial,
prior_intercept = normal(-1.4, 0.7),
prior = normal(0.07, 0.035),
chains = 4, iter = 5000*2, seed = 84735,
prior_PD = TRUE)
rain_model_prior
And plot just 100 of them:
add_fitted_draws(rain_model_prior, n = 100)
set.seed(84735)
# Plot 100 prior models with humidity
%>%
weather add_fitted_draws(rain_model_prior, n = 100) %>%
ggplot(aes(x = humidity9am, y = raintomorrow)) +
geom_line(aes(y = .value, group = .draw), size = 0.1)
# Plot the observed proportion of rain in 100 prior datasets
%>%
weather add_predicted_draws(rain_model_prior, n = 100) %>%
group_by(.draw) %>%
summarize(proportion_rain = mean(.prediction == 1)) %>%
ggplot(aes(x = proportion_rain)) +
geom_histogram(color = "white")