14.3 One Categorical Predictor

Suppose an Antarctic researcher comes across a penguin that weighs less than 4200g with a 195mm-long flipper and 50mm-long bill. Our goal is to help this researcher identify the species of this penguin: Adelie, Chinstrap, or Gentoo

image code
penguins |>
  drop_na(above_average_weight) |>
  ggplot(aes(fill = above_average_weight, x = species)) + 
  geom_bar(position = "fill") + 
  labs(title = "<span style = 'color:#067476'>For which species is a<br>below-average weight most likely?</span>",
       subtitle = "(focus on the <span style = 'color:#c65ccc'>below-average</span> category)",
       caption = "R4DS Book Club") +
  scale_fill_manual(values = c("#c65ccc", "#fb7504")) +
  theme_minimal() +
  theme(plot.title = element_markdown(face = "bold", size = 24),
        plot.subtitle = element_markdown(size = 16))

14.3.1 Recall: Bayes Rule

\[f(y|x_{1}) = \frac{\text{prior}\cdot\text{likelihood}}{\text{normalizing constant}} = \frac{f(y) \cdot L(y|x_{1})}{f(x_{1})}\] where, by the Law of Total Probability,

\[\begin{array}{rcl} f(x_{1} & = & \displaystyle\sum_{\text{all } y'} f(y')L(y'|x_{1}) \\ ~ & = & f(y' = A)L(y' = A|x_{1}) + f(y' = C)L(y' = C|x_{1}) + f(y' = G)L(y' = G|x_{1}) \\ \end{array}\]

over our three penguin species.

14.3.2 Calculation

penguins %>% 
  select(species, above_average_weight) %>% 
  na.omit() %>% 
  tabyl(species, above_average_weight) %>% 
  adorn_totals(c("row", "col"))
##    species   0   1 Total
##     Adelie 126  25   151
##  Chinstrap  61   7    68
##     Gentoo   6 117   123
##      Total 193 149   342

Prior probabilities:

\[f(y = A) = \frac{151}{342}, \quad f(y = C) = \frac{68}{342}, \quad f(y = G) = \frac{123}{342}\]

Likelihoods:

\[\begin{array}{rcccl} L(y = A | x_{1} = 0) & = & \frac{126}{151} & \approx & 0.8344 \\ L(y = C | x_{1} = 0) & = & \frac{61}{68} & \approx & 0.8971 \\ L(y = G | x_{1} = 0) & = & \frac{6}{123} & \approx & 0.0488 \\ \end{array}\]

Total probability:

\[f(x_{1} = 0) = \frac{151}{342}\cdot\frac{126}{151} + \frac{68}{342}\cdot\frac{61}{68} + \frac{123}{342}\cdot\frac{6}{123} = \frac{193}{342}\]

Bayes’ Rules:

\[\begin{array}{rcccccl} f(y = A | x_{1} = 0) & = & \frac{f(y = A) \cdot L(y = A | x_{1} = 0)}{f(x_{1} = 0)} = \frac{\frac{151}{342}\cdot\frac{126}{151}}{\frac{193}{342}} & \approx & 0.6528 \\ f(y = C | x_{1} = 0) & = & \frac{f(y = A) \cdot L(y = C | x_{1} = 0)}{f(x_{1} = 0)} = \frac{\frac{68}{342}\cdot\frac{61}{68}}{\frac{193}{342}} & \approx & 0.3161 \\ f(y = G | x_{1} = 0) & = & \frac{f(y = A) \cdot L(y = G | x_{1} = 0)}{f(x_{1} = 0)} = \frac{\frac{123}{342}\cdot\frac{6}{123}}{\frac{193}{342}} & \approx & 0.0311 \\ \end{array}\]

The posterior probability that this penguin is an Adelie is more than double that of the other two species