13.7 Building a logisic regression model

  • Instead of going through the well example, lets look at exercise 13.1. This uses the NES data but looking at more variables. There are 70 total in the data set. i picked out some of them. Note that there was also a ‘race’ categorical but I could not find figure out the categories. (A preliminary model not shown here indicates that race =2 is the only strong predictor)

part a+ b:

file_nes <- here::here("data/nes.txt")
nes92 <-
  file_nes %>% 
  read.table() %>% 
  as_tibble() %>%
  filter(year == 1992) %>%
  filter(xor(dvote, rvote)) %>%
  select(rvote, age, female, black, educ1, educ2, educ3, income)  %>%
  mutate(female = as.factor(female), black = as.factor(black))
pairs(~ educ1 + educ2 + educ3, data = nes92)

educ1 seems to be on a different scale. I am just going to combine 2 and 3 and call it good for this exercise.

nes92 <- nes92 |> mutate(educ = (educ2 + educ3)/2)

I will treat education as continuous just as was done for income.

I included some interactions that apriori seem important

fit_all <- stan_glm(rvote ~ educ + black + income + female + age + age:black + income:black + female:income, data = nes92, family=binomial(link= 'logit'), refresh = 0)
print(fit_all, digits = 3)
## stan_glm
##  family:       binomial [logit]
##  formula:      rvote ~ educ + black + income + female + age + age:black + income:black + 
##     female:income
##  observations: 1179
##  predictors:   9
## ------
##                Median MAD_SD
## (Intercept)    -1.916  0.436
## educ            0.053  0.043
## black1         -4.024  1.652
## income          0.459  0.104
## female1         0.996  0.430
## age             0.002  0.004
## black1:age      0.028  0.021
## black1:income  -0.055  0.333
## income:female1 -0.342  0.125
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

Lets look at loo. The baseline loo is

-0.693*nrow(nes92)
## [1] -817.047
loo_all <- loo(fit_all)
loo_all
## 
## Computed from 4000 by 1179 log-likelihood matrix.
## 
##          Estimate   SE
## elpd_loo   -727.9 12.0
## p_loo         8.5  0.9
## looic      1455.8 24.0
## ------
## MCSE of elpd_loo is 0.1.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.3]).
## 
## All Pareto k estimates are good (k < 0.7).
## See help('pareto-k-diagnostic') for details.

Simplify a bit by droping the poorest predictors

fit_2 <- stan_glm(rvote ~ educ + black + income + age + female + female:income, data = nes92, family=binomial(link= 'logit'), refresh = 0)
print(fit_2, digits = 3)
## stan_glm
##  family:       binomial [logit]
##  formula:      rvote ~ educ + black + income + age + female + female:income
##  observations: 1179
##  predictors:   7
## ------
##                Median MAD_SD
## (Intercept)    -1.937  0.432
## educ            0.051  0.046
## black1         -2.745  0.377
## income          0.460  0.104
## age             0.003  0.004
## female1         0.999  0.423
## income:female1 -0.344  0.129
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
loo_2 <- loo(fit_2)
loo_2
## 
## Computed from 4000 by 1179 log-likelihood matrix.
## 
##          Estimate   SE
## elpd_loo   -727.2 11.8
## p_loo         7.1  0.4
## looic      1454.5 23.6
## ------
## MCSE of elpd_loo is 0.0.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.2]).
## 
## All Pareto k estimates are good (k < 0.7).
## See help('pareto-k-diagnostic') for details.

Ok and lets try even simpler:

fit_3 <- stan_glm(rvote ~ black + income + female + female:income, data = nes92, family=binomial(link= 'logit'), refresh = 0)
print(fit_3, digits = 3)
## stan_glm
##  family:       binomial [logit]
##  formula:      rvote ~ black + income + female + female:income
##  observations: 1179
##  predictors:   5
## ------
##                Median MAD_SD
## (Intercept)    -1.678  0.345
## black1         -2.750  0.385
## income          0.488  0.098
## female1         1.009  0.418
## income:female1 -0.346  0.124
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
loo_3 <- loo(fit_3)
loo_3
## 
## Computed from 4000 by 1179 log-likelihood matrix.
## 
##          Estimate   SE
## elpd_loo   -726.1 11.7
## p_loo         5.1  0.4
## looic      1452.2 23.5
## ------
## MCSE of elpd_loo is 0.0.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 1.0]).
## 
## All Pareto k estimates are good (k < 0.7).
## See help('pareto-k-diagnostic') for details.

Finally, leave out the last interaction:

fit_4 <- stan_glm(rvote ~ black + income + female, data = nes92, family=binomial(link= 'logit'), refresh = 0)
print(fit_4, digits = 3)
## stan_glm
##  family:       binomial [logit]
##  formula:      rvote ~ black + income + female
##  observations: 1179
##  predictors:   4
## ------
##             Median MAD_SD
## (Intercept) -0.967  0.213
## black1      -2.718  0.377
## income       0.272  0.060
## female1     -0.103  0.125
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
loo_4 <- loo(fit_4)
loo_compare(loo_all,  loo_2, loo_3, loo_4)
##         elpd_diff se_diff
## fit_3    0.0       0.0   
## fit_2   -1.2       1.3   
## fit_all -1.8       1.9   
## fit_4   -3.0       2.8

From this, the simplest fit with the interaction is best, but it is not strongly significant.

  1. Looking at fit_3, what are the important variables? Discuss during meeting.

QUESTION: How can I set up this test data tibble in a less labor intensive way?

test_data = tibble(income = rep(seq(1:5),4), female =c(rep(1,10), rep(0,10)), black = rep(c(rep(0,5), rep(1,5)),2))
test_data$female = as.factor(test_data$female)
test_data$black = as.factor(test_data$black)
test_data$rvote_pred = predict(fit_3, newdata = test_data, type="response")

test_data %>% filter(black == 0) %>%
ggplot()+ geom_line(aes(x=income, y=rvote_pred, color = female)) + 
     geom_jitter(aes(x=income, y=rvote, color=female), width = .2, height = .1, data=nes92)

test_data %>% filter(female == 0) %>%
ggplot()+ geom_line(aes(x=income, y=rvote_pred, color = black)) + 
     geom_jitter(aes(x=income, y=rvote, color=black), width = .2, height = .1, data=nes92)

test_data %>% filter(female == 1) %>%
ggplot()+ geom_line(aes(x=income, y=rvote_pred, color = black)) + 
     geom_jitter(aes(x=income, y=rvote, color=black), width = .2, height = .1, data=nes92)