Case Study
When you summarize data, it is always a good idea to include a count of the number of observations. This helps you to make sure that your conclusions are based on a large enough sample size.
In this example using baseball data, we’re comparing how often a player successfully hits the ball (H) to the total number of attempts they made to hit the ball (AB). Including a count ensures our analysis is based on a reasonable amount of data and not just a few instances.
batters <- Lahman::Batting |>
group_by(playerID) |>
summarize(
performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
n = sum(AB, na.rm = TRUE)
)
batters
## # A tibble: 20,469 × 3
## playerID performance n
## <chr> <dbl> <int>
## 1 aardsda01 0 4
## 2 aaronha01 0.305 12364
## 3 aaronto01 0.229 944
## 4 aasedo01 0 5
## 5 abadan01 0.0952 21
## 6 abadfe01 0.111 9
## 7 abadijo01 0.224 49
## 8 abbated01 0.254 3044
## 9 abbeybe01 0.169 225
## 10 abbeych01 0.281 1756
## # ℹ 20,459 more rows
When we compare how well baseball players hit the ball (measured by batting average) to how many times they try (measured by times at bat), we notice two things:
Players with fewer attempts to hit show more varying results in their performance. This is a common pattern: when you compare averages for different groups, you’ll often see less variation as the group size gets larger.
Skilled players tend to have more chances to hit. This is because teams prefer to let their best players have more opportunities to bat. So, better players get more chances to hit the ball.
batters |>
filter(n > 100) |>
ggplot(aes(x = n, y = performance)) +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
If you simply rank players by batting average, the players at the top of the list will be those who had very few at-bats and happened to get a hit. These players may not actually be the most skilled players.
## # A tibble: 20,469 × 3
## playerID performance n
## <chr> <dbl> <int>
## 1 abramge01 1 1
## 2 alberan01 1 1
## 3 banisje01 1 1
## 4 bartocl01 1 1
## 5 bassdo01 1 1
## 6 birasst01 1 2
## 7 bruneju01 1 1
## 8 burnscb01 1 1
## 9 cammaer01 1 1
## 10 campsh01 1 1
## # ℹ 20,459 more rows