5.3 Class imbalance

In many instances, random splitting is not suitable. This includes datasets that contain class imbalance, meaning one class is dominated by another. Class imbalance is important to detect and take into consideration in data splitting. Performing random splitting on a dataset with severe class imbalance may cause the model to perform badly at validation. You want to avoid allocating the minority class disproportionately into the training or test set. The point is to have the same distribution across the training and test sets. Class imbalance can occur in differing degrees:

Splitting methods suited for datasets containing class imbalance should be considered. Let’s consider a #Tidytuesday dataset on Himalayan expedition members, which Julia Silge recently explored here using {tidymodels}.

library(tidyverse)
library(skimr)
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")

skim(members)

Table 5.1: Data summary
Name	members
Number of rows	76519
Number of columns	21
_______________________
Column type frequency:
character	10
logical	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
expedition_id	0	1.00	9	9	10350
member_id	0	1.00	12	12	76518
peak_id	0	1.00	4	4	391
peak_name	15	1.00	4	25	390
season	0	1.00	6	7	5
sex	2	1.00	1	1	2
citizenship	10	1.00	2	23	212
expedition_role	21	1.00	4	25	524
death_cause	75413	0.01	3	27	12
injury_type	74807	0.02	3	27	11

Variable type: logical

skim_variable	complete_rate	mean	count
hired	1	0.21	FAL: 60788, TRU: 15731
success	1	0.38	FAL: 47320, TRU: 29199
solo	1	0.00	FAL: 76398, TRU: 121
oxygen_used	1	0.24	FAL: 58286, TRU: 18233
died	1	0.01	FAL: 75413, TRU: 1106
injured	1	0.02	FAL: 74806, TRU: 1713

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2000.36	14.78	1905	1991	2004	2012	2019	▁▁▁▃▇
age	3497	0.95	37.33	10.40	7	29	36	44	85	▁▇▅▁▁
highpoint_metres	21833	0.71	7470.68	1040.06	3800	6700	7400	8400	8850	▁▁▆▃▇
death_height_metres	75451	0.01	6592.85	1308.19	400	5800	6600	7550	8830	▁▁▂▇▆
injury_height_metres	75510	0.01	7049.91	1214.24	400	6200	7100	8000	8880	▁▁▂▇▇

Let’s say we were interested in predicting the likelihood of survival or death for an expedition member. It would be a good idea to check for class imbalance:

library(janitor)

members %>% 
  tabyl(died) %>% 
  adorn_totals("row")

##   died     n    percent
##  FALSE 75413 0.98554607
##   TRUE  1106 0.01445393
##  Total 76519 1.00000000

We can see that nearly 99% of people survive their expedition. This dataset would be ripe for a sampling technique adept at handling such extreme class imbalance. This technique is called stratified sampling, in which “the training/test split is conducted separately within each class and then these subsamples are combined into the overall training and test set”. Operationally, this is done by using the strata argument inside initial_split():

set.seed(123)
members_split <- initial_split(members, prop = 0.80, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)

5.3.1 Stratified sampling simulation

With simulation we can see the effect of stratification: we expect that the expected value does not change with stratification but the variance is lower.

simulate_stratified_sampling <- function(prop_in_dataset, n_resample = 50,
                                         n_rows = 1000, seed = 45678) {
  set.seed(seed)
  
  data_to_split <- tibble(died = c(
    rep(1, floor(n_rows * prop_in_dataset)),
    rep(0, floor(n_rows * (1 - prop_in_dataset)))
  ))
  
  samples <- map_dfr(seq_len(n_resample), ~{
    initial_split(data_to_split) %>% 
      testing() %>% 
      summarize(died_pct = mean(died))
  }) 
  
  stratified_samples <- map_dfr(seq_len(n_resample), ~{
    initial_split(data_to_split, strata = died) %>% 
      testing() %>% 
      summarize(died_pct = mean(died))
  }) 
  
  rbind(
    samples %>% mutate(stratified = FALSE),
    stratified_samples %>% mutate(stratified = TRUE)
  ) %>% 
    group_by(stratified) %>% 
    summarize(mean = mean(died_pct), var = var(died_pct))
}

rsample does not stratify if class imbalance is more extreme than 10%

simulate_stratified_sampling(0.09)

## # A tibble: 2 × 3
##   stratified   mean      var
##   <lgl>       <dbl>    <dbl>
## 1 FALSE      0.0910 0.000240
## 2 TRUE       0.0917 0.000257

Stratified sampling happens:

simulate_stratified_sampling(0.11)

## # A tibble: 2 × 3
##   stratified  mean      var
##   <lgl>      <dbl>    <dbl>
## 1 FALSE      0.110 0.000257
## 2 TRUE       0.112 0