Exercises (columns)
Question 1
Compare dep_time
, sched_dep_time
, and dep_delay
. How would you expect those three numbers to be related?
The dep_time
should be sched_dep_time + dep_delay
.
## # A tibble: 336,776 × 19
## dep_time sched_dep_time dep_delay year month day arr_time sched_arr_time
## <int> <int> <dbl> <int> <int> <int> <int> <int>
## 1 517 515 2 2013 1 1 830 819
## 2 533 529 4 2013 1 1 850 830
## 3 542 540 2 2013 1 1 923 850
## 4 544 545 -1 2013 1 1 1004 1022
## 5 554 600 -6 2013 1 1 812 837
## 6 554 558 -4 2013 1 1 740 728
## 7 555 600 -5 2013 1 1 913 854
## 8 557 600 -3 2013 1 1 709 723
## 9 557 600 -3 2013 1 1 838 846
## 10 558 600 -2 2013 1 1 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Question 2
Brainstorm as many ways as possible to select dep_time
, dep_delay
, arr_time
, and arr_delay
from flights
.
You can select them by name, or the columns that starts with dep or with arr.
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
Question 3
What happens if you specify the name of the same variable multiple times in a select()
call?
It will only show the column at the first position it was specified.
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
Question 4
What does the any_of()
function do? Why might it be helpful in conjunction with this vector?
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
any_of()
doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.
The order of selected columns is determined by the order in the vector.
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights |>
select(any_of(variables))
## # A tibble: 336,776 × 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ℹ 336,766 more rows
Question 5
Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
## # A tibble: 336,776 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # ℹ 336,766 more rows
It is suprising because “TIME” is in uppercase, so it seems that contains
ignores the case.
To change this default behavior, set ignore.case = FALSE.
## # A tibble: 336,776 × 0
Question 6
Rename air_time
to air_time_min
to indicate units of measurement and move it to the beginning of the data frame.
## # A tibble: 336,776 × 19
## air_time_min year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 227 2013 1 1 517 515 2 830
## 2 227 2013 1 1 533 529 4 850
## 3 160 2013 1 1 542 540 2 923
## 4 183 2013 1 1 544 545 -1 1004
## 5 116 2013 1 1 554 600 -6 812
## 6 150 2013 1 1 554 558 -4 740
## 7 158 2013 1 1 555 600 -5 913
## 8 53 2013 1 1 557 600 -3 709
## 9 140 2013 1 1 557 600 -3 838
## 10 138 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Question 7
Why doesn’t the following work, and what does the error mean?
flights |>
select(tailnum) |>
arrange(arr_delay)
#> Error in `arrange()`:
#> ℹ In argument: `..1 = arr_delay`.
#> Caused by error:
#> ! object 'arr_delay' not found
The code doesn’t work because the select()
will give a tibble with only the tailnum
column, so the arrange will not find the column arr_delay
.