+ - 0:00:00
Notes for current slide
Notes for next slide

R for Data Science

Chapter 5: Data Transformation

Silvia Canelón

@spcanelon

2020-08-21

1 / 59

Outline

2 / 59

Outline

  • Introduction
3 / 59

Outline

  • Introduction

  • Filter rows with filter()

4 / 59

Outline

  • Introduction

  • Filter rows with filter()

  • Arrange rows with arrange()

5 / 59

Outline

  • Introduction

  • Filter rows with filter()

  • Arrange rows with arrange()

  • Select columns with select()

6 / 59

Outline

  • Introduction

  • Filter rows with filter()

  • Arrange rows with arrange()

  • Select columns with select()

  • Add new variables with mutate()

7 / 59

Outline

  • Introduction

  • Filter rows with filter()

  • Arrange rows with arrange()

  • Select columns with select()

  • Add new variables with mutate()

  • Grouped summaries with summarize()

8 / 59

Outline

  • Introduction

  • Filter rows with filter()

  • Arrange rows with arrange()

  • Select columns with select()

  • Add new variables with mutate()

  • Grouped summaries with summarize()

  • Grouped mutate (and filters)

9 / 59

Introduction

Data transformation helps you get the data in exactly the right form you need.

10 / 59

Introduction

Data transformation helps you get the data in exactly the right form you need.

  • create new variables
  • create summaries
  • rename variables
  • reorder observations
  • ...and more!
11 / 59

Introduction

Data transformation helps you get the data in exactly the right form you need.

  • create new variables
  • create summaries
  • rename variables
  • reorder observations
  • ...and more!

This chapter uses the nycflights13 📦

  • Data set includes 336,776 flights that departed from New York City in 2013
  • US Bureau of Transportation Statistics
12 / 59

Introduction

flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR
## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA
## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK
## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
13 / 59

Introduction

head(flights, 4)
## # A tibble: 4 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
14 / 59

Introduction

head(flights, 4)
## # A tibble: 4 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  • int: integers
  • dbl: doubles, or real numbers
  • chr: character vectors, or strings
  • dttm: date-times (a date + a time)
15 / 59

Introduction

head(flights, 4)
## # A tibble: 4 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  • int: integers
  • dbl: doubles, or real numbers
  • chr: character vectors, or strings
  • dttm: date-times (a date + a time)
  • lgl: logical, vectors with only TRUE or FALSE
  • fctr: factors, categorical variables
  • date: dates
16 / 59

Introduction

dplyr basics

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).
17 / 59

Introduction

dplyr basics

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).
  • + group_by() which gets the above functions to operate group-by-group rather than on the entire dataset

How do they work?

  1. First argument is data frame
  2. Other arguments describe what to do with the data frame
  3. Result is a new data frame!
18 / 59

Filter rows with filter()

Lets you subset observations based on their values.

19 / 59

Filter rows with filter()

Lets you subset observations based on their values.

filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR
## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA
## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK
## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## # … with 832 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
20 / 59

Comparisons

R provides the standard suite of comparison operators:

  • >
  • >=
  • <
  • <=
  • != (not equal)
  • == (equal)
21 / 59

Logical operators

  • & is "and"
  • | is "or"
  • ! is "not"

22 / 59

Missing values

  • Missing values a.k.a. NAs a.k.a. "not availables" represent an unknown value
23 / 59

Missing values

  • Missing values a.k.a. NAs a.k.a. "not availables" represent an unknown value

  • They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown

24 / 59

Missing values

  • Missing values a.k.a. NAs a.k.a. "not availables" represent an unknown value

  • They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown

Also important because...

filter() only includes rows where the condition is TRUE (i.e. it includes both FALSE and NA values)

25 / 59

Missing values

  • Missing values a.k.a. NAs a.k.a. "not availables" represent an unknown value

  • They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown

Also important because...

filter() only includes rows where the condition is TRUE (i.e. it includes both FALSE and NA values)

If you want to preserve missing values you must ask for them explicitly:

filter(df, is.na(x) | x > 1)
26 / 59

Ex: Find all flights that were operated by United, American, or Delta

## # A tibble: 2 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
27 / 59

Ex: Find all flights that were operated by United, American, or Delta

## # A tibble: 2 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO"
28 / 59

Ex: Find all flights that were operated by United, American, or Delta

## # A tibble: 2 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO"
# option 1
flights %>% filter(carrier = "UA" | "AA" | "DL")
# option 2
flights %>% filter(carrier == "UA" | "AA" | "DL")
# option 3
flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL")
# option 4
flights %>% filter(carrier %in% c("UA", "AA", "DL"))
29 / 59

Ex: Find all flights that were operated by United, American, or Delta

flights %>% filter(carrier = "UA" | "AA" | "DL")

Error: Problem with filter() input ..1. x Input ..1 is named. ℹ This usually means that you've used = instead of ==. ℹ Did you mean carrier == "UA" | "AA" | "DL"?

30 / 59

Ex: Find all flights that were operated by United, American, or Delta

flights %>% filter(carrier == "UA" | "AA" | "DL")

Error: Problem with filter() input ..1. x operations are possible only for numeric, logical or complex types ℹ Input ..1 is carrier == "UA" | "AA" | "DL".

31 / 59

Ex: Find all flights that were operated by United, American, or Delta

flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL")
## # A tibble: 139,504 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 6 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## 7 2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK
## 8 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR
## 9 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA
## 10 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR
## # … with 139,494 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
32 / 59

Ex: Find all flights that were operated by United, American, or Delta

flights %>% filter(carrier %in% c("UA", "AA", "DL"))
## # A tibble: 139,504 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 6 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## 7 2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK
## 8 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR
## 9 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA
## 10 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR
## # … with 139,494 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
33 / 59

Arrange rows with arrange()

Changes the order of the rows

Takes a data frame and orders it by a set of column names (or more complicated expressions)

34 / 59

Arrange rows with arrange()

Changes the order of the rows

Takes a data frame and orders it by a set of column names (or more complicated expressions)

  • orders a column in ascending order by default
  • desc() can be used to reorder a column in descending order
35 / 59

Arrange rows with arrange()

Changes the order of the rows

Takes a data frame and orders it by a set of column names (or more complicated expressions)

  • orders a column in ascending order by default
  • desc() can be used to reorder a column in descending order
  • Note: missing values (NAs) are always sorted at the end
36 / 59

Ex: Sort flights to find the most delayed flights. Find the flights that left earliest.

flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR
## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA
## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK
## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
37 / 59

Sort flights to find the most delayed flights

arrange(flights, desc(dep_delay))
## # A tibble: 3 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA JFK
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504MQ JFK
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517MQ EWR
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
38 / 59

Sort flights to find the most delayed flights

arrange(flights, desc(dep_delay))
## # A tibble: 3 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA JFK
## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504MQ JFK
## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517MQ EWR
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Find the flights that left earliest

arrange(flights, dep_delay)
## # A tibble: 3 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 12 7 2040 2123 -43 40 2352 48 B6 97 N592JB JFK
## 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL 1715 N612DL LGA
## 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV 5713 N825AS LGA
## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
39 / 59

Select columns with select()

Sometimes we get data sets that have hundreds or thousands of variables 😬

40 / 59

Select columns with select()

Sometimes we get data sets that have hundreds or thousands of variables 😬

select() gives us a way to narrow down to only those variables we're interested in! 🎉

41 / 59

Select columns with select()

Sometimes we get data sets that have hundreds or thousands of variables 😬

select() gives us a way to narrow down to only those variables we're interested in! 🎉

Helper functions include

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters (more in Chapter 14: Strings)
  • num_range("x", 1:3): matches x1, x2 and x3.
42 / 59

Select columns with select()

Sometimes we get data sets that have hundreds or thousands of variables 😬

select() gives us a way to narrow down to only those variables we're interested in! 🎉

Helper functions include

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters (more in Chapter 14: Strings)
  • num_range("x", 1:3): matches x1, x2 and x3.
  • everything(): useful when you want to move a few variables to the start of the data frame
43 / 59

Ex: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR
## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA
## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK
## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
44 / 59

Ex: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
45 / 59

Ex: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, starts_with("dep") | starts_with("arr"))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
46 / 59

Ex: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, ends_with("time") | ends_with("delay"))
## # A tibble: 336,776 x 7
## dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay arr_delay
## <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 517 515 830 819 227 2 11
## 2 533 529 850 830 227 4 20
## 3 542 540 923 850 160 2 33
## 4 544 545 1004 1022 183 -1 -18
## 5 554 600 812 837 116 -6 -25
## 6 554 558 740 728 150 -4 12
## 7 555 600 913 854 158 -5 19
## 8 557 600 709 723 53 -3 -14
## 9 557 600 838 846 140 -3 -8
## 10 558 600 753 745 138 -2 8
## # … with 336,766 more rows

Not quite! Why?

47 / 59

Add new variables with mutate()

Helps you add new columns that are functions of existing columns

48 / 59

Add new variables with mutate()

Helps you add new columns that are functions of existing columns

  • always adds columns at the end of your dataset
49 / 59

Add new variables with mutate()

Helps you add new columns that are functions of existing columns

  • always adds columns at the end of your dataset
  • if you want to only keep the new variables, you can use transmute()
50 / 59

Add new variables with mutate()

Can be used with arithmetic operators

  • +
  • -
  • *
  • /
  • ^
  • sum()
51 / 59

Add new variables with mutate()

Can be used with arithmetic operators

  • +
  • -
  • *
  • /
  • ^
  • sum()

modular arithmetic

  • %/% (integer division)
  • %% (remainder)
  • log()
  • log2()
  • log10()
52 / 59

Add new variables with mutate()

Can be used with arithmetic operators

  • +
  • -
  • *
  • /
  • ^
  • sum()

modular arithmetic

  • %/% (integer division)
  • %% (remainder)
  • log()
  • log2()
  • log10()
  • logical comparisons: <, <=, >, >=, !=
  • offsets: lead() and lag()
  • ranking: min_rank(), row_number(), dense_rank(), and more!
53 / 59

Grouped summaries with summarize()

Collapses a data frame into a single row and more useful when combined with group_by()

54 / 59

Grouped summaries with summarize()

Collapses a data frame into a single row and more useful when combined with group_by()

Arguably most useful when used with the pipe %>% operator

55 / 59

Grouped summaries with summarize()

Collapses a data frame into a single row and more useful when combined with group_by()

Arguably most useful when used with the pipe %>% operator

Go from this:

by_dest <- group_by(flights, dest)
delay <- summarize(by_dest, count = n(),
dis = mean(distance, na.rm = TRUE),
delay = mean(arr_del, na.rm = TRUE))
56 / 59

Grouped summaries with summarize()

Collapses a data frame into a single row and more useful when combined with group_by()

Arguably most useful when used with the pipe %>% operator

Go from this:

by_dest <- group_by(flights, dest)
delay <- summarize(by_dest, count = n(),
dis = mean(distance, na.rm = TRUE),
delay = mean(arr_del, na.rm = TRUE))

To this! 🎉

flights %>%
group_by(dest) %>%
summarize(by_dest, count = n(),
dis = mean(distance, na.rm = TRUE),
delay = mean(arr_del, na.rm = TRUE))

Spot the differences!

57 / 59

Grouped mutate (and filters)

Most useful in combination with summarize()

58 / 59

Grouped mutate (and filters)

Most useful in combination with summarize()

For example:

popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)

What does it do?

59 / 59

Outline

2 / 59
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow