Exercises (rows)
(NOTE: using relocate()
and mutate()
from next section for better visualization)
Question 1
In a single pipeline for each condition, find all flights that meet the condition:
# Had an arrival delay of two or more hours
flights |>
filter(arr_delay >= 120) |>
arrange(desc(arr_delay)) |>
relocate(arr_delay)
## # A tibble: 10,200 × 19
## arr_delay year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 1272 2013 1 9 641 900 1301 1242
## 2 1127 2013 6 15 1432 1935 1137 1607
## 3 1109 2013 1 10 1121 1635 1126 1239
## 4 1007 2013 9 20 1139 1845 1014 1457
## 5 989 2013 7 22 845 1600 1005 1044
## 6 931 2013 4 10 1100 1900 960 1342
## 7 915 2013 3 17 2321 810 911 135
## 8 895 2013 7 22 2257 759 898 121
## 9 878 2013 12 5 756 1700 896 1058
## 10 875 2013 5 3 1133 2055 878 1250
## # ℹ 10,190 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 9,313 × 19
## dest year month day dep_time sched_dep_time dep_delay arr_time
## <chr> <int> <int> <int> <int> <int> <dbl> <int>
## 1 IAH 2013 1 1 517 515 2 830
## 2 IAH 2013 1 1 533 529 4 850
## 3 IAH 2013 1 1 623 627 -4 933
## 4 IAH 2013 1 1 728 732 -4 1041
## 5 IAH 2013 1 1 739 739 0 1104
## 6 IAH 2013 1 1 908 908 0 1228
## 7 IAH 2013 1 1 1028 1026 2 1350
## 8 IAH 2013 1 1 1044 1045 -1 1352
## 9 IAH 2013 1 1 1114 900 134 1447
## 10 IAH 2013 1 1 1205 1200 5 1503
## # ℹ 9,303 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# Were operated by United, American, or Delta
flights |>
filter(carrier %in% c("UA", "DL", "AA")) |>
relocate(carrier)
## # A tibble: 139,504 × 19
## carrier year month day dep_time sched_dep_time dep_delay arr_time
## <chr> <int> <int> <int> <int> <int> <dbl> <int>
## 1 UA 2013 1 1 517 515 2 830
## 2 UA 2013 1 1 533 529 4 850
## 3 AA 2013 1 1 542 540 2 923
## 4 DL 2013 1 1 554 600 -6 812
## 5 UA 2013 1 1 554 558 -4 740
## 6 AA 2013 1 1 558 600 -2 753
## 7 UA 2013 1 1 558 600 -2 924
## 8 UA 2013 1 1 558 600 -2 923
## 9 AA 2013 1 1 559 600 -1 941
## 10 UA 2013 1 1 559 600 -1 854
## # ℹ 139,494 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# Departed in summer (July, August, and September)
flights |>
filter(month %in% c(7, 8, 9)) |>
relocate(month)
## # A tibble: 86,326 × 19
## month year day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 7 2013 1 1 2029 212 236 2359
## 2 7 2013 1 2 2359 3 344 344
## 3 7 2013 1 29 2245 104 151 1
## 4 7 2013 1 43 2130 193 322 14
## 5 7 2013 1 44 2150 174 300 100
## 6 7 2013 1 46 2051 235 304 2358
## 7 7 2013 1 48 2001 287 308 2305
## 8 7 2013 1 58 2155 183 335 43
## 9 7 2013 1 100 2146 194 327 30
## 10 7 2013 1 100 2245 135 337 135
## # ℹ 86,316 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# Arrived more than two hours late, but didn’t leave late
flights |>
filter(arr_delay >= 120 & dep_delay <= 0) |>
relocate(arr_delay, dep_delay)
## # A tibble: 29 × 19
## arr_delay dep_delay year month day dep_time sched_dep_time arr_time
## <dbl> <dbl> <int> <int> <int> <int> <int> <int>
## 1 124 -1 2013 1 27 1419 1420 1754
## 2 130 0 2013 10 7 1350 1350 1736
## 3 124 -2 2013 10 7 1357 1359 1858
## 4 122 -3 2013 10 16 657 700 1258
## 5 194 -2 2013 11 1 658 700 1329
## 6 140 -3 2013 3 18 1844 1847 39
## 7 124 -5 2013 4 17 1635 1640 2049
## 8 179 -2 2013 4 18 558 600 1149
## 9 143 -5 2013 4 18 655 700 1213
## 10 127 -3 2013 5 22 1827 1830 2217
## # ℹ 19 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# Were delayed by at least an hour, but made up over 30 minutes in flight
flights |>
filter(dep_delay >= 60 & dep_delay - arr_delay > 30) |>
relocate(dep_delay, arr_delay)
## # A tibble: 1,844 × 19
## dep_delay arr_delay year month day dep_time sched_dep_time arr_time
## <dbl> <dbl> <int> <int> <int> <int> <int> <int>
## 1 285 246 2013 1 1 2205 1720 46
## 2 116 73 2013 1 1 2326 2130 131
## 3 162 128 2013 1 3 1503 1221 1803
## 4 99 66 2013 1 3 1839 1700 2056
## 5 65 28 2013 1 3 1850 1745 2148
## 6 102 67 2013 1 3 1941 1759 2246
## 7 65 1 2013 1 3 1950 1845 2228
## 8 60 24 2013 1 3 2015 1915 2135
## 9 177 141 2013 1 3 2257 2000 45
## 10 137 105 2013 1 4 1917 1700 2135
## # ℹ 1,834 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Question 2
Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.
## # A tibble: 336,776 × 19
## dep_delay dep_time year month day sched_dep_time arr_time sched_arr_time
## <dbl> <int> <int> <int> <int> <int> <int> <int>
## 1 271 1 2013 4 10 1930 106 2101
## 2 266 1 2013 5 22 1935 154 2140
## 3 251 1 2013 6 24 1950 105 2130
## 4 212 1 2013 7 1 2029 236 2359
## 5 181 1 2013 1 31 2100 124 2225
## 6 181 1 2013 2 11 2100 111 2225
## 7 153 1 2013 3 18 2128 247 2355
## 8 151 1 2013 6 25 2130 249 14
## 9 76 1 2013 2 24 2245 121 2354
## 10 72 1 2013 1 13 2249 108 2357
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Question 3
Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
## # A tibble: 336,776 × 20
## speed year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 703. 2013 5 25 1709 1700 9 1923
## 2 650. 2013 7 2 1558 1513 45 1745
## 3 648 2013 5 13 2040 2025 15 2225
## 4 641. 2013 3 23 1914 1910 4 2045
## 5 591. 2013 1 12 1559 1600 -1 1849
## 6 564 2013 11 17 650 655 -5 1059
## 7 557. 2013 2 21 2355 2358 -3 412
## 8 556. 2013 11 17 759 800 -1 1212
## 9 554. 2013 11 16 2003 1925 38 17
## 10 554. 2013 11 16 2349 2359 -10 402
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Question 4
Was there a flight on every day of 2013?
## # A tibble: 365 × 4
## year month day n
## <int> <int> <int> <int>
## 1 2013 11 28 634
## 2 2013 11 29 661
## 3 2013 1 19 674
## 4 2013 10 12 676
## 5 2013 1 26 680
## 6 2013 8 31 680
## 7 2013 2 2 682
## 8 2013 9 28 682
## 9 2013 2 9 684
## 10 2013 10 19 684
## # ℹ 355 more rows
## # A tibble: 365 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 2
## 3 2013 1 3
## 4 2013 1 4
## 5 2013 1 5
## 6 2013 1 6
## 7 2013 1 7
## 8 2013 1 8
## 9 2013 1 9
## 10 2013 1 10
## # ℹ 355 more rows
Question 5
Which flights traveled the farthest distance? Which traveled the least distance?
# Flights that traveled the farthest distance:
flights |>
arrange(desc(distance)) |>
relocate(distance)
## # A tibble: 336,776 × 19
## distance year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 4983 2013 1 1 857 900 -3 1516
## 2 4983 2013 1 2 909 900 9 1525
## 3 4983 2013 1 3 914 900 14 1504
## 4 4983 2013 1 4 900 900 0 1516
## 5 4983 2013 1 5 858 900 -2 1519
## 6 4983 2013 1 6 1019 900 79 1558
## 7 4983 2013 1 7 1042 900 102 1620
## 8 4983 2013 1 8 901 900 1 1504
## 9 4983 2013 1 9 641 900 1301 1242
## 10 4983 2013 1 10 859 900 -1 1449
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 × 19
## distance year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 17 2013 7 27 NA 106 NA NA
## 2 80 2013 1 3 2127 2129 -2 2222
## 3 80 2013 1 4 1240 1200 40 1333
## 4 80 2013 1 4 1829 1615 134 1937
## 5 80 2013 1 4 2128 2129 -1 2218
## 6 80 2013 1 5 1155 1200 -5 1241
## 7 80 2013 1 6 2125 2129 -4 2224
## 8 80 2013 1 7 2124 2129 -5 2212
## 9 80 2013 1 8 2127 2130 -3 2304
## 10 80 2013 1 9 2126 2129 -3 2217
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Question 6
Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
If arrange()
is used before filter()
, the sorting will be applied to all the rows in the dataset, including those that will be filtered out later. This can result in unnecessary work for the computer and slower performance. On the other hand, if filter()
is used before arrange()
, the sorting will only be applied to the remaining rows after filtering, which can be more efficient and faster.