Exercises (columns)

Question 1

Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

The dep_time should be sched_dep_time + dep_delay.

flights |> 
  relocate(dep_time, sched_dep_time, dep_delay)

## # A tibble: 336,776 × 19
##    dep_time sched_dep_time dep_delay  year month   day arr_time sched_arr_time
##       <int>          <int>     <dbl> <int> <int> <int>    <int>          <int>
##  1      517            515         2  2013     1     1      830            819
##  2      533            529         4  2013     1     1      850            830
##  3      542            540         2  2013     1     1      923            850
##  4      544            545        -1  2013     1     1     1004           1022
##  5      554            600        -6  2013     1     1      812            837
##  6      554            558        -4  2013     1     1      740            728
##  7      555            600        -5  2013     1     1      913            854
##  8      557            600        -3  2013     1     1      709            723
##  9      557            600        -3  2013     1     1      838            846
## 10      558            600        -2  2013     1     1      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 2

Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

You can select them by name, or the columns that starts with dep or with arr.

flights |> 
  select(dep_time, dep_delay, arr_time, arr_delay)

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows

flights |> 
  select(starts_with("dep"), starts_with("arr"))

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows

Question 3

What happens if you specify the name of the same variable multiple times in a select() call?

It will only show the column at the first position it was specified.

flights |> 
  select(dep_time, dep_delay, arr_time, arr_delay, dep_time)

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows

Question 4

What does the any_of() function do? Why might it be helpful in conjunction with this vector?

variables <- c("year", "month", "day", "dep_delay", "arr_delay")

any_of() doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.

The order of selected columns is determined by the order in the vector.

variables <- c("year", "month", "day", "dep_delay", "arr_delay")

flights |> 
  select(any_of(variables))

## # A tibble: 336,776 × 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ℹ 336,766 more rows

Question 5

Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

flights |> select(contains("TIME"))

## # A tibble: 336,776 × 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # ℹ 336,766 more rows

It is suprising because “TIME” is in uppercase, so it seems that contains ignores the case.

To change this default behavior, set ignore.case = FALSE.

flights |> 
  select(contains("TIME", ignore.case = FALSE))

## # A tibble: 336,776 × 0

Question 6

Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

flights |>
  rename(air_time_min = air_time) |>
  relocate(air_time_min)

## # A tibble: 336,776 × 19
##    air_time_min  year month   day dep_time sched_dep_time dep_delay arr_time
##           <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1          227  2013     1     1      517            515         2      830
##  2          227  2013     1     1      533            529         4      850
##  3          160  2013     1     1      542            540         2      923
##  4          183  2013     1     1      544            545        -1     1004
##  5          116  2013     1     1      554            600        -6      812
##  6          150  2013     1     1      554            558        -4      740
##  7          158  2013     1     1      555            600        -5      913
##  8           53  2013     1     1      557            600        -3      709
##  9          140  2013     1     1      557            600        -3      838
## 10          138  2013     1     1      558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 7

Why doesn’t the following work, and what does the error mean?

flights |> 
  select(tailnum) |> 
  arrange(arr_delay)
#> Error in `arrange()`:
#> ℹ In argument: `..1 = arr_delay`.
#> Caused by error:
#> ! object 'arr_delay' not found

The code doesn’t work because the select() will give a tibble with only the tailnum column, so the arrange will not find the column arr_delay.