18.2 Hierarchical logistic regression
# Load packages
library(bayesrules)
library(tidyverse)
library(bayesplot)
library(rstanarm)
library(tidybayes)
library(broom.mixed)
library(janitor)
<- climbers_sub %>%
climbers select(expedition_id, member_id, success, year, season,
age, expedition_role, oxygen_used)
Data are from The Himalayan Database, Himalayan Climbing Expeditions data shared through the #tidytuesday project R for Data Science 2020b, made of 2076 climbers, dating back to 1978.
# Import, rename, & clean data
data(climbers_sub)
<- climbers_sub %>%
climbers select(expedition_id, member_id, success, year, season,
age, expedition_role, oxygen_used)nrow(climbers)
## [1] 2076
%>%head climbers
# A tibble: 6 × 8
expedition_id member_id success year season age expedition_role oxygen_used
<chr> <fct> <lgl> <dbl> <fct> <dbl> <fct> <lgl>
1 AMAD81101 AMAD8110… TRUE 1981 Spring 28 Climber FALSE
2 AMAD81101 AMAD8110… TRUE 1981 Spring 27 Exp Doctor FALSE
3 AMAD81101 AMAD8110… TRUE 1981 Spring 35 Deputy Leader FALSE
4 AMAD81101 AMAD8110… TRUE 1981 Spring 37 Climber FALSE
5 AMAD81101 AMAD8110… TRUE 1981 Spring 43 Climber FALSE
6 AMAD81101 AMAD8110… FALSE 1981 Spring 38 Climber FALSE
To generate a frequency table we use the tabyl()
function:
?janitor::tabyl
%>%
climbers ::tabyl(success) janitor
success n percent
FALSE 1269 0.6112717
TRUE 807 0.3887283
# Size per expedition
<- climbers %>% count(expedition_id)
climbers_per_expedition
# Number of expeditions
nrow(climbers_per_expedition)
## [1] 200
The individual climber outcomes are independent.
%>%
climbers_per_expedition head(3)
## # A tibble: 3 × 2
## expedition_id n
## <chr> <int>
## 1 AMAD03107 5
## 2 AMAD03327 6
## 3 AMAD05338 12
More than 75 of our 200 expeditions had a 0% success rate. In contrast, nearly 20 expeditions had a 100% success rate.
There’s quite a bit of variability in expedition success rates.
# Calculate the success rate for each exhibition
<- climbers %>%
expedition_success group_by(expedition_id) %>%
summarize(success_rate = mean(success))

Figure 18.2: Histogram of the success rates for the 200 climbing expeditions
18.2.1 Model building & simulation
We use the Bernoulli model for binary response variable: expedtion success
Yij={0yes1no
Xij1=age of climber j in expedition j Xij2=climber i received oxygen in expedition j
Bayesian model
Yij|πij∼Bern(πij) Complete pooling expands this simple model into a logistic regression model
Yij|β0,β1,β2ind∼Bern(πij)
with
- β0 the typical baseline success rate across all expeditions
- β1 the global relationship between success and age when controlling for oxygen use
- β2 the global relationship between success and oxygen use when **controlling for age
log(πij1−πij)=β0+β1Xij1+β2Xij2 We need to take account for the grouping structure of our data.
# Calculate the success rate by age and oxygen use
<- climbers %>%
data_by_age_oxygen group_by(age, oxygen_used) %>%
summarize(success_rate = mean(success),.groups="drop")

Figure 18.3: Scatterplot of the success rate among climbers by age and oxygen use
We substitute β0 with the centered intercept β0c.
β0c∼N(m0,s20)
m0=0 and s20=2.52, s21=0.242, s22=5.512
β0j|β0,σ0ind∼N(β0,s20)
and,
σ0∼Exp(1)
Reframe the random intercepts logistic regression model as a tweaks to the global intercept:
log(πij1−πij)=(β0+b0j)+β1Xij1+β2Xij2
- b0j depends on the variability
b0j|σ0ind∼N(0,σ20)
- σ0 captures the between-group variability in success rates from expedition to expedition
Set the model formula to hierarchical grouping structure:
success ~ age + oxygen_used + (1 | expedition_id)
<- stan_glmer(
climb_model ~ age + oxygen_used + (1 | expedition_id),
success data = climbers, family = binomial,
prior_intercept = normal(0, 2.5, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
chains = 4, iter = 5000*2, seed = 84735
)
Confirm prior specifications
prior_summary(climb_model)
MCMC diagnostics
mcmc_trace(climb_model, size = 0.1)
mcmc_dens_overlay(climb_model)
mcmc_acf(climb_model)
neff_ratio(climb_model)
rhat(climb_model)
Define success rate function
<- function(x){mean(x == 1)} success_rate
Posterior predictive check
pp_check(climb_model, nreps = 100,
plotfun = "stat", stat = "success_rate") +
xlab("success rate")

(#fig:18-pp_check_fig)The histogram displays the proportion of climbers that were successful in each of 100 posterior simulated datasets
18.2.2 Posterior analysis
Posterior summaries for our global regression parameters reveal that the likelihood of success decreases with age.
tidy(climb_model,
effects = "fixed",
conf.int = TRUE,
conf.level = 0.80)
Confidence intervals are expressed as the log(odds) to the odds scale, so will be converted to:
(e−conf.low,e−conf.high)=(a,b)
On the probability of success scale
π=e−β0−β1X1+β2X21+e−β0−β1X1+β2X2
Results are: both with oxygen and without, the probability of success decreases with age.
%>%
climbers add_fitted_draws(climb_model, n = 100, re_formula = NA) %>%
ggplot(aes(x = age, y = success, color = oxygen_used)) +
geom_line(aes(y = .value, group = paste(oxygen_used, .draw)),
alpha = 0.1) +
labs(y = "probability of success")
18.2.3 Posterior classification
New expedition
<- data.frame(
new_expedition age = c(20, 20, 60, 60), oxygen_used = c(FALSE, TRUE, FALSE, TRUE),
expedition_id = rep("new", 4))
new_expedition
## age oxygen_used expedition_id
## 1 20 FALSE new
## 2 20 TRUE new
## 3 60 FALSE new
## 4 60 TRUE new
Posterior predictions of binary outcome
set.seed(84735)
<- posterior_predict(climb_model, newdata = new_expedition) binary_prediction
# First 3 prediction sets
head(binary_prediction, 3)
## 1 2 3 4
## [1,] 0 1 0 1
## [2,] 0 1 0 0
## [3,] 0 1 0 1
Summarize the posterior predictions of Y
colMeans(binary_prediction)
## 1 2 3 4
## 0.28470 0.80185 0.14755 0.64565