Exercises (rows)

(NOTE: using relocate() and mutate() from next section for better visualization)

Question 1

In a single pipeline for each condition, find all flights that meet the condition:

# Had an arrival delay of two or more hours
flights |>
  filter(arr_delay >= 120) |>
  arrange(desc(arr_delay)) |>
  relocate(arr_delay)
## # A tibble: 10,200 × 19
##    arr_delay  year month   day dep_time sched_dep_time dep_delay arr_time
##        <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1      1272  2013     1     9      641            900      1301     1242
##  2      1127  2013     6    15     1432           1935      1137     1607
##  3      1109  2013     1    10     1121           1635      1126     1239
##  4      1007  2013     9    20     1139           1845      1014     1457
##  5       989  2013     7    22      845           1600      1005     1044
##  6       931  2013     4    10     1100           1900       960     1342
##  7       915  2013     3    17     2321            810       911      135
##  8       895  2013     7    22     2257            759       898      121
##  9       878  2013    12     5      756           1700       896     1058
## 10       875  2013     5     3     1133           2055       878     1250
## # ℹ 10,190 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Flew to Houston (IAH or HOU)
flights |>
  filter(dest %in% c("IAH", "HOU")) |>
  relocate(dest)
## # A tibble: 9,313 × 19
##    dest   year month   day dep_time sched_dep_time dep_delay arr_time
##    <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1 IAH    2013     1     1      517            515         2      830
##  2 IAH    2013     1     1      533            529         4      850
##  3 IAH    2013     1     1      623            627        -4      933
##  4 IAH    2013     1     1      728            732        -4     1041
##  5 IAH    2013     1     1      739            739         0     1104
##  6 IAH    2013     1     1      908            908         0     1228
##  7 IAH    2013     1     1     1028           1026         2     1350
##  8 IAH    2013     1     1     1044           1045        -1     1352
##  9 IAH    2013     1     1     1114            900       134     1447
## 10 IAH    2013     1     1     1205           1200         5     1503
## # ℹ 9,303 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Were operated by United, American, or Delta
flights |>
  filter(carrier %in% c("UA", "DL", "AA")) |>
  relocate(carrier)
## # A tibble: 139,504 × 19
##    carrier  year month   day dep_time sched_dep_time dep_delay arr_time
##    <chr>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1 UA       2013     1     1      517            515         2      830
##  2 UA       2013     1     1      533            529         4      850
##  3 AA       2013     1     1      542            540         2      923
##  4 DL       2013     1     1      554            600        -6      812
##  5 UA       2013     1     1      554            558        -4      740
##  6 AA       2013     1     1      558            600        -2      753
##  7 UA       2013     1     1      558            600        -2      924
##  8 UA       2013     1     1      558            600        -2      923
##  9 AA       2013     1     1      559            600        -1      941
## 10 UA       2013     1     1      559            600        -1      854
## # ℹ 139,494 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Departed in summer (July, August, and September)
flights |>
  filter(month %in% c(7, 8, 9)) |>
  relocate(month)
## # A tibble: 86,326 × 19
##    month  year   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1     7  2013     1        1           2029       212      236           2359
##  2     7  2013     1        2           2359         3      344            344
##  3     7  2013     1       29           2245       104      151              1
##  4     7  2013     1       43           2130       193      322             14
##  5     7  2013     1       44           2150       174      300            100
##  6     7  2013     1       46           2051       235      304           2358
##  7     7  2013     1       48           2001       287      308           2305
##  8     7  2013     1       58           2155       183      335             43
##  9     7  2013     1      100           2146       194      327             30
## 10     7  2013     1      100           2245       135      337            135
## # ℹ 86,316 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Arrived more than two hours late, but didn’t leave late
flights |> 
  filter(arr_delay >= 120 & dep_delay <= 0) |>
  relocate(arr_delay, dep_delay)
## # A tibble: 29 × 19
##    arr_delay dep_delay  year month   day dep_time sched_dep_time arr_time
##        <dbl>     <dbl> <int> <int> <int>    <int>          <int>    <int>
##  1       124        -1  2013     1    27     1419           1420     1754
##  2       130         0  2013    10     7     1350           1350     1736
##  3       124        -2  2013    10     7     1357           1359     1858
##  4       122        -3  2013    10    16      657            700     1258
##  5       194        -2  2013    11     1      658            700     1329
##  6       140        -3  2013     3    18     1844           1847       39
##  7       124        -5  2013     4    17     1635           1640     2049
##  8       179        -2  2013     4    18      558            600     1149
##  9       143        -5  2013     4    18      655            700     1213
## 10       127        -3  2013     5    22     1827           1830     2217
## # ℹ 19 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Were delayed by at least an hour, but made up over 30 minutes in flight
flights |> 
  filter(dep_delay >= 60 & dep_delay - arr_delay > 30) |>
  relocate(dep_delay, arr_delay)
## # A tibble: 1,844 × 19
##    dep_delay arr_delay  year month   day dep_time sched_dep_time arr_time
##        <dbl>     <dbl> <int> <int> <int>    <int>          <int>    <int>
##  1       285       246  2013     1     1     2205           1720       46
##  2       116        73  2013     1     1     2326           2130      131
##  3       162       128  2013     1     3     1503           1221     1803
##  4        99        66  2013     1     3     1839           1700     2056
##  5        65        28  2013     1     3     1850           1745     2148
##  6       102        67  2013     1     3     1941           1759     2246
##  7        65         1  2013     1     3     1950           1845     2228
##  8        60        24  2013     1     3     2015           1915     2135
##  9       177       141  2013     1     3     2257           2000       45
## 10       137       105  2013     1     4     1917           1700     2135
## # ℹ 1,834 more rows
## # ℹ 11 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 2

Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

flights |> 
  arrange(desc(dep_delay)) |> 
  arrange(dep_time) |>
  relocate(dep_delay, dep_time)
## # A tibble: 336,776 × 19
##    dep_delay dep_time  year month   day sched_dep_time arr_time sched_arr_time
##        <dbl>    <int> <int> <int> <int>          <int>    <int>          <int>
##  1       271        1  2013     4    10           1930      106           2101
##  2       266        1  2013     5    22           1935      154           2140
##  3       251        1  2013     6    24           1950      105           2130
##  4       212        1  2013     7     1           2029      236           2359
##  5       181        1  2013     1    31           2100      124           2225
##  6       181        1  2013     2    11           2100      111           2225
##  7       153        1  2013     3    18           2128      247           2355
##  8       151        1  2013     6    25           2130      249             14
##  9        76        1  2013     2    24           2245      121           2354
## 10        72        1  2013     1    13           2249      108           2357
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 3

Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

flights |> 
  mutate(speed = distance / (air_time / 60)) |>
  arrange(desc(speed)) |>
  relocate(speed)
## # A tibble: 336,776 × 20
##    speed  year month   day dep_time sched_dep_time dep_delay arr_time
##    <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  703.  2013     5    25     1709           1700         9     1923
##  2  650.  2013     7     2     1558           1513        45     1745
##  3  648   2013     5    13     2040           2025        15     2225
##  4  641.  2013     3    23     1914           1910         4     2045
##  5  591.  2013     1    12     1559           1600        -1     1849
##  6  564   2013    11    17      650            655        -5     1059
##  7  557.  2013     2    21     2355           2358        -3      412
##  8  556.  2013    11    17      759            800        -1     1212
##  9  554.  2013    11    16     2003           1925        38       17
## 10  554.  2013    11    16     2349           2359       -10      402
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Question 4

Was there a flight on every day of 2013?

# using count
flights |>
  count(year, month, day, sort = TRUE) |>
  arrange(n)
## # A tibble: 365 × 4
##     year month   day     n
##    <int> <int> <int> <int>
##  1  2013    11    28   634
##  2  2013    11    29   661
##  3  2013     1    19   674
##  4  2013    10    12   676
##  5  2013     1    26   680
##  6  2013     8    31   680
##  7  2013     2     2   682
##  8  2013     9    28   682
##  9  2013     2     9   684
## 10  2013    10    19   684
## # ℹ 355 more rows
# using distinct
flights |> 
  distinct(year, month, day)
## # A tibble: 365 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     2
##  3  2013     1     3
##  4  2013     1     4
##  5  2013     1     5
##  6  2013     1     6
##  7  2013     1     7
##  8  2013     1     8
##  9  2013     1     9
## 10  2013     1    10
## # ℹ 355 more rows
# Yes, there was a flight every day as the resulting tibble has 365 rows

Question 5

Which flights traveled the farthest distance? Which traveled the least distance?

# Flights that traveled the farthest distance:
flights |> 
  arrange(desc(distance)) |>
  relocate(distance)
## # A tibble: 336,776 × 19
##    distance  year month   day dep_time sched_dep_time dep_delay arr_time
##       <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1     4983  2013     1     1      857            900        -3     1516
##  2     4983  2013     1     2      909            900         9     1525
##  3     4983  2013     1     3      914            900        14     1504
##  4     4983  2013     1     4      900            900         0     1516
##  5     4983  2013     1     5      858            900        -2     1519
##  6     4983  2013     1     6     1019            900        79     1558
##  7     4983  2013     1     7     1042            900       102     1620
##  8     4983  2013     1     8      901            900         1     1504
##  9     4983  2013     1     9      641            900      1301     1242
## 10     4983  2013     1    10      859            900        -1     1449
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Flights that traveled the shortest distance
flights |> 
  arrange(distance) |>
  relocate(distance)
## # A tibble: 336,776 × 19
##    distance  year month   day dep_time sched_dep_time dep_delay arr_time
##       <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1       17  2013     7    27       NA            106        NA       NA
##  2       80  2013     1     3     2127           2129        -2     2222
##  3       80  2013     1     4     1240           1200        40     1333
##  4       80  2013     1     4     1829           1615       134     1937
##  5       80  2013     1     4     2128           2129        -1     2218
##  6       80  2013     1     5     1155           1200        -5     1241
##  7       80  2013     1     6     2125           2129        -4     2224
##  8       80  2013     1     7     2124           2129        -5     2212
##  9       80  2013     1     8     2127           2130        -3     2304
## 10       80  2013     1     9     2126           2129        -3     2217
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 6

Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

If arrange() is used before filter(), the sorting will be applied to all the rows in the dataset, including those that will be filtered out later. This can result in unnecessary work for the computer and slower performance. On the other hand, if filter() is used before arrange(), the sorting will only be applied to the remaining rows after filtering, which can be more efficient and faster.