Exercises (groups)
Question 1
Which carrier has the worst average delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers?
Why/why not?
(Hint: think about flights |> group_by(carrier, dest) |> summarize(n())
)
The carrier with worst average delays is F9
flights |>
group_by(carrier) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
) |>
arrange(desc(delay))
## # A tibble: 16 × 3
## carrier delay n
## <chr> <dbl> <int>
## 1 F9 21.9 685
## 2 FL 20.1 3260
## 3 EV 15.8 54173
## 4 YV 15.6 601
## 5 OO 11.9 32
## 6 MQ 10.8 26397
## 7 WN 9.65 12275
## 8 B6 9.46 54635
## 9 9E 7.38 18460
## 10 UA 3.56 58665
## 11 US 2.13 20536
## 12 VX 1.76 5162
## 13 DL 1.64 48110
## 14 AA 0.364 32729
## 15 HA -6.92 342
## 16 AS -9.93 714
To try to disentangle the effects of bad airports vs. bad carriers and using the hint to group and summarize:
# calculate average delay and sort
flights |>
group_by(carrier, dest) |>
summarize(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()
) |>
arrange(desc(avg_delay))
## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.
## # A tibble: 314 × 4
## # Groups: carrier [16]
## carrier dest avg_delay n
## <chr> <chr> <dbl> <int>
## 1 UA STL 110 2
## 2 OO ORD 107 1
## 3 OO DTW 68.5 2
## 4 UA RDU 56 1
## 5 EV CAE 42.8 113
## 6 EV TYS 41.2 323
## 7 EV PBI 40.7 6
## 8 EV TUL 33.7 315
## 9 EV OKC 30.6 346
## 10 UA JAC 29.9 23
## # ℹ 304 more rows
# add total_delay and sort
flights |>
group_by(carrier, dest) |>
summarize(
delay = sum(arr_delay, na.rm = TRUE),
n = n()
) |>
mutate(total_delay = delay * n) |>
arrange(desc(total_delay))
## `summarise()` has grouped output by 'carrier'. You can override using the
## `.groups` argument.
## # A tibble: 314 × 5
## # Groups: carrier [16]
## carrier dest delay n total_delay
## <chr> <chr> <dbl> <int> <dbl>
## 1 DL ATL 77598 10571 820288458
## 2 B6 FLL 77376 6563 507818688
## 3 B6 MCO 73996 6472 478902112
## 4 UA ORD 40910 6984 285715440
## 5 EV IAD 59196 4048 239625408
## 6 WN MDW 49766 4113 204687558
## 7 MQ RDU 40018 4794 191846292
## 8 US CLT 20760 8632 179200320
## 9 UA IAH 25403 6924 175890372
## 10 B6 BOS 34143 4383 149648769
## # ℹ 304 more rows
In the 1st calculation, UA to STL has a very high average delay, but it only happens twice. In the 2nd calculation, DL to ATL has the higher total delay.
Question 2
Find the flights that are most delayed upon departure from each destination.
The top 10 most departure delayed flights from each destination.
flights |>
group_by(dest) |>
slice_max(dep_delay, n = 1) |>
relocate(dep_delay, dest) |>
arrange(desc(dep_delay))
## # A tibble: 105 × 19
## # Groups: dest [105]
## dep_delay dest year month day dep_time sched_dep_time arr_time
## <dbl> <chr> <int> <int> <int> <int> <int> <int>
## 1 1301 HNL 2013 1 9 641 900 1242
## 2 1137 CMH 2013 6 15 1432 1935 1607
## 3 1126 ORD 2013 1 10 1121 1635 1239
## 4 1014 SFO 2013 9 20 1139 1845 1457
## 5 1005 CVG 2013 7 22 845 1600 1044
## 6 960 TPA 2013 4 10 1100 1900 1342
## 7 911 MSP 2013 3 17 2321 810 135
## 8 899 PDX 2013 6 27 959 1900 1236
## 9 898 ATL 2013 7 22 2257 759 121
## 10 896 MIA 2013 12 5 756 1700 1058
## # ℹ 95 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# using by in slice_max
flights |>
slice_max(
dep_delay, n = 1,
by = dest
) |>
relocate(dep_delay, dest) |>
arrange(desc(dep_delay))
## # A tibble: 105 × 19
## dep_delay dest year month day dep_time sched_dep_time arr_time
## <dbl> <chr> <int> <int> <int> <int> <int> <int>
## 1 1301 HNL 2013 1 9 641 900 1242
## 2 1137 CMH 2013 6 15 1432 1935 1607
## 3 1126 ORD 2013 1 10 1121 1635 1239
## 4 1014 SFO 2013 9 20 1139 1845 1457
## 5 1005 CVG 2013 7 22 845 1600 1044
## 6 960 TPA 2013 4 10 1100 1900 1342
## 7 911 MSP 2013 3 17 2321 810 135
## 8 899 PDX 2013 6 27 959 1900 1236
## 9 898 ATL 2013 7 22 2257 759 121
## 10 896 MIA 2013 12 5 756 1700 1058
## # ℹ 95 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Question 3
How do delays vary over the course of the day. Illustrate your answer with a plot.
# Calculate the average departure delay for each hour of the day
delays_by_hour <- flights %>%
group_by(hour) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE))
# Plot the results
ggplot(delays_by_hour, aes(x = hour, y = avg_dep_delay)) +
geom_line() +
ggtitle("Average Departure Delay by Hour of the Day") +
xlab("Hour of the Day") +
ylab("Average Departure Delay (minutes)")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
Question 4
What happens if you supply a negative n
to slice_min()
and friends?
It doesn’t slice the data but it does arrange the values.
## # A tibble: 5 × 19
## dep_delay year month day dep_time sched_dep_time arr_time sched_arr_time
## <dbl> <int> <int> <int> <int> <int> <int> <int>
## 1 -43 2013 12 7 2040 2123 40 2352
## 2 -33 2013 2 3 2022 2055 2240 2338
## 3 -32 2013 11 10 1408 1440 1549 1559
## 4 -30 2013 1 11 1900 1930 2233 2243
## 5 -27 2013 1 29 1703 1730 1947 1957
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 × 19
## dep_delay year month day dep_time sched_dep_time arr_time sched_arr_time
## <dbl> <int> <int> <int> <int> <int> <int> <int>
## 1 -43 2013 12 7 2040 2123 40 2352
## 2 -33 2013 2 3 2022 2055 2240 2338
## 3 -32 2013 11 10 1408 1440 1549 1559
## 4 -30 2013 1 11 1900 1930 2233 2243
## 5 -27 2013 1 29 1703 1730 1947 1957
## 6 -26 2013 8 9 729 755 1002 955
## 7 -25 2013 10 23 1907 1932 2143 2143
## 8 -25 2013 3 30 2030 2055 2213 2250
## 9 -24 2013 3 2 1431 1455 1601 1631
## 10 -24 2013 5 5 934 958 1225 1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Question 5
Explain what count()
does in terms of the dplyr verbs you just learned.
What does the sort
argument to count()
do?
count()
is used to count the number of rows in a data frame, grouped by one or more variables.
The sort
argument specifies if how the results are sorted (TRUE
ascending) is the default.