Exercises (groups)

Question 1

Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

The carrier with worst average delays is F9

flights |>
  group_by(carrier) |>
  summarize(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  ) |>
  arrange(desc(delay))

## # A tibble: 16 × 3
##    carrier  delay     n
##    <chr>    <dbl> <int>
##  1 F9      21.9     685
##  2 FL      20.1    3260
##  3 EV      15.8   54173
##  4 YV      15.6     601
##  5 OO      11.9      32
##  6 MQ      10.8   26397
##  7 WN       9.65  12275
##  8 B6       9.46  54635
##  9 9E       7.38  18460
## 10 UA       3.56  58665
## 11 US       2.13  20536
## 12 VX       1.76   5162
## 13 DL       1.64  48110
## 14 AA       0.364 32729
## 15 HA      -6.92    342
## 16 AS      -9.93    714

To try to disentangle the effects of bad airports vs. bad carriers and using the hint to group and summarize:

# calculate average delay and sort

flights |>
  group_by(carrier, dest) |>
  summarize(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  ) |>
  arrange(desc(avg_delay))

## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.

## # A tibble: 314 × 4
## # Groups:   carrier [16]
##    carrier dest  avg_delay     n
##    <chr>   <chr>     <dbl> <int>
##  1 UA      STL       110       2
##  2 OO      ORD       107       1
##  3 OO      DTW        68.5     2
##  4 UA      RDU        56       1
##  5 EV      CAE        42.8   113
##  6 EV      TYS        41.2   323
##  7 EV      PBI        40.7     6
##  8 EV      TUL        33.7   315
##  9 EV      OKC        30.6   346
## 10 UA      JAC        29.9    23
## # ℹ 304 more rows

# add total_delay and sort
flights |>
  group_by(carrier, dest) |>
  summarize(
    delay = sum(arr_delay, na.rm = TRUE),
    n = n()
  ) |>
  mutate(total_delay = delay * n) |> 
  arrange(desc(total_delay))

## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.

## # A tibble: 314 × 5
## # Groups:   carrier [16]
##    carrier dest  delay     n total_delay
##    <chr>   <chr> <dbl> <int>       <dbl>
##  1 DL      ATL   77598 10571   820288458
##  2 B6      FLL   77376  6563   507818688
##  3 B6      MCO   73996  6472   478902112
##  4 UA      ORD   40910  6984   285715440
##  5 EV      IAD   59196  4048   239625408
##  6 WN      MDW   49766  4113   204687558
##  7 MQ      RDU   40018  4794   191846292
##  8 US      CLT   20760  8632   179200320
##  9 UA      IAH   25403  6924   175890372
## 10 B6      BOS   34143  4383   149648769
## # ℹ 304 more rows

In the 1st calculation, UA to STL has a very high average delay, but it only happens twice. In the 2nd calculation, DL to ATL has the higher total delay.

Question 2

Find the flights that are most delayed upon departure from each destination.

The top 10 most departure delayed flights from each destination.

flights |>
  group_by(dest) |>
  slice_max(dep_delay, n = 1) |>
  relocate(dep_delay, dest) |>
  arrange(desc(dep_delay))

## # A tibble: 105 × 19
## # Groups:   dest [105]
##    dep_delay dest   year month   day dep_time sched_dep_time arr_time
##        <dbl> <chr> <int> <int> <int>    <int>          <int>    <int>
##  1      1301 HNL    2013     1     9      641            900     1242
##  2      1137 CMH    2013     6    15     1432           1935     1607
##  3      1126 ORD    2013     1    10     1121           1635     1239
##  4      1014 SFO    2013     9    20     1139           1845     1457
##  5      1005 CVG    2013     7    22      845           1600     1044
##  6       960 TPA    2013     4    10     1100           1900     1342
##  7       911 MSP    2013     3    17     2321            810      135
##  8       899 PDX    2013     6    27      959           1900     1236
##  9       898 ATL    2013     7    22     2257            759      121
## 10       896 MIA    2013    12     5      756           1700     1058
## # ℹ 95 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

# using by in slice_max

flights |>
  slice_max(
    dep_delay, n = 1,
    by = dest
  ) |>
  relocate(dep_delay, dest) |>
  arrange(desc(dep_delay))

## # A tibble: 105 × 19
##    dep_delay dest   year month   day dep_time sched_dep_time arr_time
##        <dbl> <chr> <int> <int> <int>    <int>          <int>    <int>
##  1      1301 HNL    2013     1     9      641            900     1242
##  2      1137 CMH    2013     6    15     1432           1935     1607
##  3      1126 ORD    2013     1    10     1121           1635     1239
##  4      1014 SFO    2013     9    20     1139           1845     1457
##  5      1005 CVG    2013     7    22      845           1600     1044
##  6       960 TPA    2013     4    10     1100           1900     1342
##  7       911 MSP    2013     3    17     2321            810      135
##  8       899 PDX    2013     6    27      959           1900     1236
##  9       898 ATL    2013     7    22     2257            759      121
## 10       896 MIA    2013    12     5      756           1700     1058
## # ℹ 95 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 3

How do delays vary over the course of the day. Illustrate your answer with a plot.

# Calculate the average departure delay for each hour of the day
delays_by_hour <- flights %>%
  group_by(hour) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE))


# Plot the results
ggplot(delays_by_hour, aes(x = hour, y = avg_dep_delay)) +
  geom_line() +
  ggtitle("Average Departure Delay by Hour of the Day") +
  xlab("Hour of the Day") +
  ylab("Average Departure Delay (minutes)")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

Question 4

What happens if you supply a negative n to slice_min() and friends?

It doesn’t slice the data but it does arrange the values.

# only shows the lowest 5 rows
flights |> 
  slice_min(dep_delay, n = 5) |>
  relocate(dep_delay)

## # A tibble: 5 × 19
##   dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
##       <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
## 1       -43  2013    12     7     2040           2123       40           2352
## 2       -33  2013     2     3     2022           2055     2240           2338
## 3       -32  2013    11    10     1408           1440     1549           1559
## 4       -30  2013     1    11     1900           1930     2233           2243
## 5       -27  2013     1    29     1703           1730     1947           1957
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

# shows all rows
flights |> 
  slice_min(dep_delay, n = -5) |>
  relocate(dep_delay)

## # A tibble: 336,776 × 19
##    dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
##        <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
##  1       -43  2013    12     7     2040           2123       40           2352
##  2       -33  2013     2     3     2022           2055     2240           2338
##  3       -32  2013    11    10     1408           1440     1549           1559
##  4       -30  2013     1    11     1900           1930     2233           2243
##  5       -27  2013     1    29     1703           1730     1947           1957
##  6       -26  2013     8     9      729            755     1002            955
##  7       -25  2013    10    23     1907           1932     2143           2143
##  8       -25  2013     3    30     2030           2055     2213           2250
##  9       -24  2013     3     2     1431           1455     1601           1631
## 10       -24  2013     5     5      934            958     1225           1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 5

Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

count() is used to count the number of rows in a data frame, grouped by one or more variables.

The sort argument specifies if how the results are sorted (TRUE ascending) is the default.