Add new variables with mutate()
mutate() adds new columns based on values from existing columns. Data frame includes existing and new columns.
Compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:
## # A tibble: 336,776 × 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>
Use the .before argument to instead add the variables to the left hand side
## # A tibble: 336,776 × 21
## gain speed year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 -9 370. 2013 1 1 517 515 2 830
## 2 -16 374. 2013 1 1 533 529 4 850
## 3 -31 408. 2013 1 1 542 540 2 923
## 4 17 517. 2013 1 1 544 545 -1 1004
## 5 19 394. 2013 1 1 554 600 -6 812
## 6 -16 288. 2013 1 1 554 558 -4 740
## 7 -24 404. 2013 1 1 555 600 -5 913
## 8 11 259. 2013 1 1 557 600 -3 709
## 9 5 405. 2013 1 1 557 600 -3 838
## 10 -10 319. 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Use .after to add after a variable, and in both .before and .after you can use the variable name instead of a position.
Add the new variables after day:
## # A tibble: 336,776 × 21
## year month day gain speed dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int>
## 1 2013 1 1 -9 370. 517 515 2 830
## 2 2013 1 1 -16 374. 533 529 4 850
## 3 2013 1 1 -31 408. 542 540 2 923
## 4 2013 1 1 17 517. 544 545 -1 1004
## 5 2013 1 1 19 394. 554 600 -6 812
## 6 2013 1 1 -16 288. 554 558 -4 740
## 7 2013 1 1 -24 404. 555 600 -5 913
## 8 2013 1 1 11 259. 557 600 -3 709
## 9 2013 1 1 5 405. 557 600 -3 838
## 10 2013 1 1 -10 319. 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
You can use the .keep argument to control which variables are kept after a mutate() operation. Setting .keep = "used" ensures only columns involved or created in the mutate() step are retained, like dep_delay, arr_delay, air_time, gain, hours, and gain_per_hour.
flights |>
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)## # A tibble: 336,776 × 6
## dep_delay arr_delay air_time gain hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 11 227 -9 3.78 -2.38
## 2 4 20 227 -16 3.78 -4.23
## 3 2 33 160 -31 2.67 -11.6
## 4 -1 -18 183 17 3.05 5.57
## 5 -6 -25 116 19 1.93 9.83
## 6 -4 12 150 -16 2.5 -6.4
## 7 -5 19 158 -24 2.63 -9.11
## 8 -3 -14 53 11 0.883 12.5
## 9 -3 -8 140 5 2.33 2.14
## 10 -2 8 138 -10 2.3 -4.35
## # ℹ 336,766 more rows
If you don’t assign the result of the computation back to flights, the new variables will only be displayed, not stored. Consider whether to overwrite flights with more variables or create a new object for future use.