summarize()

Summarization is the most important grouped operation, because it reduces the data frame to a single row for each group, summarizing the data with a single statistic. The following example computes the average departure delay by month:

flights |> 
  group_by(month) |> 
  summarize(
    avg_delay = mean(dep_delay)
  )
## # A tibble: 12 × 2
##    month avg_delay
##    <int>     <dbl>
##  1     1        NA
##  2     2        NA
##  3     3        NA
##  4     4        NA
##  5     5        NA
##  6     6        NA
##  7     7        NA
##  8     8        NA
##  9     9        NA
## 10    10        NA
## 11    11        NA
## 12    12        NA

The above example show NAs because some of the observed flights had missing data in the delay column. To ignore the missing data, use:

flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE)
  )
## # A tibble: 12 × 2
##    month delay
##    <int> <dbl>
##  1     1 10.0 
##  2     2 10.8 
##  3     3 13.2 
##  4     4 13.9 
##  5     5 13.0 
##  6     6 20.8 
##  7     7 21.7 
##  8     8 12.6 
##  9     9  6.72
## 10    10  6.24
## 11    11  5.44
## 12    12 16.6

In a single call to summarize() you can create any number of summaries. There are different useful summaries available, one of which is n(), which returns the number of rows in each group:

flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE), 
    n = n()
  )
## # A tibble: 12 × 3
##    month delay     n
##    <int> <dbl> <int>
##  1     1 10.0  27004
##  2     2 10.8  24951
##  3     3 13.2  28834
##  4     4 13.9  28330
##  5     5 13.0  28796
##  6     6 20.8  28243
##  7     7 21.7  29425
##  8     8 12.6  29327
##  9     9  6.72 27574
## 10    10  6.24 28889
## 11    11  5.44 27268
## 12    12 16.6  28135