Naive Bayes Example

Data: Palmer Penguins

There exist multiple penguin species throughout Antarctica, including the Adelie, Chinstrap, and Gentoo. When encountering one of these penguins on an Antarctic trip, we might classify its species

Y={AAdelieCChinstrapGGentoo

three species
three species

Example comes from chapter 14 of Bayes Rules!

Bayes Rules! textbook
Bayes Rules! textbook

X1 categorical variable: whether the penguin weighs more than the average 4200 grams

X1={1above-average weight0below-average weight

AKA culmen length and depth
AKA culmen length and depth

Numerical variables:

X2=bill length (mm)X3=flipper length (mm)

data(penguins_bayes)
penguins <- penguins_bayes

adelie_color = "#fb7504"
chinstrap_color = "#c65ccc"
gentoo_color = "#067476"

penguins |>
  tabyl(species)
##    species   n   percent
##     Adelie 152 0.4418605
##  Chinstrap  68 0.1976744
##     Gentoo 124 0.3604651

Motivation

Here, we have three categories, whereas logistic regression is limited to classifying binary response variables. As an alternative, naive Bayes classification

  • can classify categorical response variables Y with two or more categories
  • doesn’t require much theory beyond Bayes’ Rule
  • it’s computationally efficient, i.e., doesn’t require MCMC simulation

But why is it called “naive”?

One Categorical Predictor

Suppose an Antarctic researcher comes across a penguin that weighs less than 4200g with a 195mm-long flipper and 50mm-long bill. Our goal is to help this researcher identify the species of this penguin: Adelie, Chinstrap, or Gentoo

image code
penguins |>
  drop_na(above_average_weight) |>
  ggplot(aes(fill = above_average_weight, x = species)) + 
  geom_bar(position = "fill") + 
  labs(title = "<span style = 'color:#067476'>For which species is a<br>below-average weight most likely?</span>",
       subtitle = "(focus on the <span style = 'color:#c65ccc'>below-average</span> category)",
       caption = "R4DS Book Club") +
  scale_fill_manual(values = c("#c65ccc", "#fb7504")) +
  theme_minimal() +
  theme(plot.title = element_markdown(face = "bold", size = 24),
        plot.subtitle = element_markdown(size = 16))

Recall: Bayes Rule

f(y|x1)=priorlikelihoodnormalizing constant=f(y)L(y|x1)f(x1) where, by the Law of Total Probability,

f(x1=all yf(y)L(y|x1) =f(y=A)L(y=A|x1)+f(y=C)L(y=C|x1)+f(y=G)L(y=G|x1)

over our three penguin species.

Calculation

penguins |> 
  select(species, above_average_weight) |> 
  na.omit() |> 
  tabyl(species, above_average_weight) |> 
  adorn_totals(c("row", "col"))
##    species   0   1 Total
##     Adelie 126  25   151
##  Chinstrap  61   7    68
##     Gentoo   6 117   123
##      Total 193 149   342

Prior probabilities:

f(y=A)=151342,f(y=C)=68342,f(y=G)=123342

Likelihoods:

L(y=A|x1=0)=1261510.8344L(y=C|x1=0)=61680.8971L(y=G|x1=0)=61230.0488

Total probability:

f(x1=0)=151342126151+683426168+1233426123=193342

Bayes’ Rules:

f(y=A|x1=0)=f(y=A)L(y=A|x1=0)f(x1=0)=1513421261511933420.6528f(y=C|x1=0)=f(y=A)L(y=C|x1=0)f(x1=0)=6834261681933420.3161f(y=G|x1=0)=f(y=A)L(y=G|x1=0)f(x1=0)=12334261231933420.0311

The posterior probability that this penguin is an Adelie is more than double that of the other two species

One Numerical Predictor

Let’s ignore the penguin’s weight for now and classify its species using only the fact that it has a 50mm-long bill

image code
penguins|>
  ggplot(aes(x = bill_length_mm, fill = species)) + 
  geom_density(alpha = 0.7) + 
  geom_vline(xintercept = 50, linetype = "dashed", linewidth = 3) + 
  labs(title = "<span style = 'color:#c65ccc'>For which species is a<br>50mm-long bill the most common?</span>",
       subtitle = "one numerical predictor",
       caption = "R4DS Book Club") +
  scale_fill_manual(values = c(adelie_color, chinstrap_color, gentoo_color)) +
  theme_minimal() +
  theme(plot.title = element_markdown(face = "bold", size = 24),
        plot.subtitle = element_markdown(size = 16))

Our data points to our penguin being a Chinstrap

  • we must weigh this data against the fact that Chinstraps are the rarest of these three species
  • difficult to compute likelihood L(y=A|x2=50)

This is where one “naive” part of naive Bayes classification comes into play. The naive Bayes method typically assumes that any quantitative predictor, here X2, is continuous and conditionally normal:

X2|(Y=A)N(μA,σ2A)X2|(Y=C)N(μC,σ2C)X2|(Y=G)N(μG,σ2G)

Prior Probability Distributions

# Calculate sample mean and sd for each Y group
penguins |> 
  group_by(species) |> 
  summarize(mean = mean(bill_length_mm, na.rm = TRUE), 
            sd = sd(bill_length_mm, na.rm = TRUE))
## # A tibble: 3 × 3
##   species    mean    sd
##   <fct>     <dbl> <dbl>
## 1 Adelie     38.8  2.66
## 2 Chinstrap  48.8  3.34
## 3 Gentoo     47.5  3.08
penguins |>
  ggplot(aes(x = bill_length_mm, color = species)) + 
  stat_function(fun = dnorm, args = list(mean = 38.8, sd = 2.66), 
                aes(color = "Adelie"), linewidth = 3) +
  stat_function(fun = dnorm, args = list(mean = 48.8, sd = 3.34),
                aes(color = "Chinstrap"), linewidth = 3) +
  stat_function(fun = dnorm, args = list(mean = 47.5, sd = 3.08),
                aes(color = "Gentoo"), linewidth = 3) +
  ...

image code
penguins |>
  ggplot(aes(x = bill_length_mm, color = species)) + 
  stat_function(fun = dnorm, args = list(mean = 38.8, sd = 2.66), 
                aes(color = "Adelie"), linewidth = 3) +
  stat_function(fun = dnorm, args = list(mean = 48.8, sd = 3.34),
                aes(color = "Chinstrap"), linewidth = 3) +
  stat_function(fun = dnorm, args = list(mean = 47.5, sd = 3.08),
                aes(color = "Gentoo"), linewidth = 3) + 
  geom_vline(xintercept = 50, linetype = "dashed") + 
  labs(title = "<span style = 'color:#c65ccc'>Prior Probabilities</span>",
       subtitle = "conditionally normal",
       caption = "R4DS Book Club") +
  scale_color_manual(values = c(adelie_color, chinstrap_color, gentoo_color)) +
  theme_minimal() +
  theme(plot.title = element_markdown(face = "bold", size = 24),
        plot.subtitle = element_markdown(size = 16))

Computing the likelihoods in R:

# L(y = A | x_2 = 50) = 2.12e-05
dnorm(50, mean = 38.8, sd = 2.66)

# L(y = C | x_2 = 50) = 0.112
dnorm(50, mean = 48.8, sd = 3.34)

# L(y = G | x_2 = 50) = 0.09317
dnorm(50, mean = 47.5, sd = 3.08)

Total probability:

f(x2=50)=1513420.0000212+683420.112+1233420.093170.05579

Bayes’ Rules:

f(y=A|x2=50)=f(y=A)L(y=A|x1=0)f(x1=0)=1513420.00002120.055790.0002f(y=C|x2=50)=f(y=A)L(y=C|x1=0)f(x1=0)=683420.1120.055790.3992f(y=G|x2=50)=f(y=A)L(y=G|x1=0)f(x1=0)=1233420.093170.055790.6006

Though a 50mm-long bill is relatively less common among Gentoo than among Chinstrap, it follows that our naive Bayes classification, based on our prior information and penguin’s bill length alone, is that this penguin is a Gentoo – it has the highest posterior probability.

We’ve now made two naive Bayes classifications of our penguin’s species, one based solely on the fact that our penguin has below-average weight and the other based solely on its 50mm-long bill (in addition to our prior information). And these classifications disagree: we classified the penguin as Adelie in the former analysis and Gentoo in the latter. This discrepancy indicates that there’s room for improvement in our naive Bayes classification method.

Two Predictor Variables

image code
penguins |>
ggplot(aes(x = flipper_length_mm, y = bill_length_mm, 
           color = species)) + 
  geom_point(size = 3) + 
  geom_segment(aes(x = 195, y = 30, xend = 195, yend = 50),
               color = "black", linetype = 2, linewidth = 2) +
  geom_segment(aes(x = 170, y = 50, xend = 195, yend = 50),
               color = "black", linetype = 2, linewidth = 2) +
  labs(title = "<span style = 'color:#c65ccc'>Two Predictor Variables</span>",
       subtitle = "50mm-long bill and 195mm-long flipper",
       caption = "R4DS Book Club") +
  scale_color_manual(values = c(adelie_color, chinstrap_color, gentoo_color)) +
  theme_minimal() +
  theme(plot.title = element_markdown(face = "bold", size = 24),
        plot.subtitle = element_markdown(size = 16))

Generalizing Bayes’ Rule:

f(y|x2,x3)=f(y)L(y|x2,x3)yf(y)L(y|x2,x3)

Another “naive” assumption of conditionally independent:

L(y|x2,x3)=f(x2,x3|y)=f(x2|y)f(x3|y)

  • mathematically efficient
  • but what about correlation?
# sample statistics of x_3: flipper length
penguins |> 
  group_by(species) |> 
  summarize(mean = mean(flipper_length_mm, na.rm = TRUE), 
            sd = sd(flipper_length_mm, na.rm = TRUE))
## # A tibble: 3 × 3
##   species    mean    sd
##   <fct>     <dbl> <dbl>
## 1 Adelie     190.  6.54
## 2 Chinstrap  196.  7.13
## 3 Gentoo     217.  6.48

Likelihoods of a flipper length of 195 mm:

# L(y = A | x_3 = 195) = 0.04554
dnorm(195, mean = 190, sd = 6.54)

# L(y = C | x_3 = 195) = 0.05541
dnorm(195, mean = 196, sd = 7.13)

# L(y = G | x_3 = 195) = 0.0001934
dnorm(195, mean = 217, sd = 6.48)

Total probability:

f(x2=50,x3=195)=1513420.00002120.04554+683420.1120.05541+1233420.093170.00019310.001241

Bayes’ Rules:

f(y=A|x2=50,x3=195)=1513420.00002120.045540.00019310.0003f(y=C|x2=50,x3=195)=683420.1120.055410.00019310.9944f(y=G|x2=50,x3=195)=1233420.093170.00019310.00019310.0052

In conclusion, our penguin is almost certainly a Chinstrap.