class: center, middle, inverse, title-slide # R for Data Science ## Chapter 5: Data Transformation ### Silvia Canelón ###
@spcanelon
### 2020-08-21 --- # Outline -- - Introduction -- - Filter rows with `filter()` -- - Arrange rows with `arrange()` -- - Select columns with `select()` -- - Add new variables with `mutate()` -- - Grouped summaries with `summarize()` -- - Grouped mutate (and filters) --- ## Introduction ### Data transformation helps you get the data in exactly the right form you need. -- - create new variables - create summaries - rename variables - reorder observations - ...and more! -- ### This chapter uses the `nycflights13` 📦 - Data set includes 336,776 flights that departed from New York City in 2013 - US Bureau of Transportation Statistics --- ## Introduction ```r flights ``` ``` ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- ## Introduction ```r head(flights, 4) ``` ``` ## # A tibble: 4 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` -- .pull-left[ - `int`: integers - `dbl`: doubles, or real numbers - `chr`: character vectors, or strings - `dttm`: date-times (a date + a time) ] -- .pull-right[ - `lgl`: logical, vectors with only TRUE or FALSE - `fctr`: factors, categorical variables - `date`: dates ] --- ## Introduction ### **dplyr** basics - Pick observations by their values (`filter()`). - Reorder the rows (`arrange()`). - Pick variables by their names (`select()`). - Create new variables with functions of existing variables (`mutate()`). - Collapse many values down to a single summary (`summarise()`). -- - **+** `group_by()` which gets the above functions to operate group-by-group rather than on the entire dataset ### How do they work? 1. First argument is data frame 2. Other arguments describe what to do with the data frame 3. Result is a new data frame! --- ## Filter rows with `filter()` Lets you subset observations based on their values. -- ```r filter(flights, month == 1, day == 1) ``` ``` ## # A tibble: 842 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## # … with 832 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm> ``` --- ### Comparisons R provides the standard suite of comparison operators: - `>` - `>=` - `<` - `<=` - `!=` (not equal) - `==` (equal) --- ### Logical operators - `&` is "and" - `|` is "or" - `!` is "not" .center[ <img src="https://d33wubrfki0l68.cloudfront.net/01f4b6d39d2be8269740a3ad7946faa79f7243cf/8369a/diagrams/transform-logical.png" height="400" /> ] --- ### Missing values - Missing values a.k.a. `NA`s a.k.a. "not availables" represent an unknown value -- - They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown -- ### Also important because... `filter()` only includes rows where the condition is `TRUE` (i.e. it includes both `FALSE` and `NA` values) -- If you want to preserve missing values you must ask for them explicitly: ```r filter(df, is.na(x) | x > 1) ``` --- ### Ex: Find all flights that were operated by United, American, or Delta ``` ## # A tibble: 2 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` -- ``` ## [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO" ``` -- ```r # option 1 flights %>% filter(carrier = "UA" | "AA" | "DL") # option 2 flights %>% filter(carrier == "UA" | "AA" | "DL") # option 3 flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL") # option 4 flights %>% filter(carrier %in% c("UA", "AA", "DL")) ``` --- #### Ex: Find all flights that were operated by United, American, or Delta ```r flights %>% filter(carrier = "UA" | "AA" | "DL") ``` > Error: Problem with `filter()` input `..1`. x Input `..1` is named. ℹ This usually means that you've used `=` instead of `==`. ℹ Did you mean `carrier == "UA" | "AA" | "DL"`? --- #### Ex: Find all flights that were operated by United, American, or Delta ```r flights %>% filter(carrier == "UA" | "AA" | "DL") ``` > Error: Problem with `filter()` input `..1`. x operations are possible only for numeric, logical or complex types ℹ Input `..1` is `carrier == "UA" | "AA" | "DL"`. --- #### Ex: Find all flights that were operated by United, American, or Delta ```r flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL") ``` ``` ## # A tibble: 139,504 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 6 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## 7 2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK ## 8 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR ## 9 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA ## 10 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR ## # … with 139,494 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- #### Ex: Find all flights that were operated by United, American, or Delta ```r flights %>% filter(carrier %in% c("UA", "AA", "DL")) ``` ``` ## # A tibble: 139,504 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 6 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## 7 2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK ## 8 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR ## 9 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA ## 10 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR ## # … with 139,494 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- ## Arrange rows with `arrange()` ### Changes the order of the rows Takes a data frame and orders it by a set of column names (or more complicated expressions) -- - orders a column in ascending order by default - `desc()` can be used to reorder a column in `desc`ending order -- - **Note:** missing values (`NA`s) are always sorted at the end --- ### Ex: Sort flights to find the most delayed flights. Find the flights that left earliest. ```r flights ``` ``` ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- #### Sort flights to find the most delayed flights ```r arrange(flights, desc(dep_delay)) ``` ``` ## # A tibble: 3 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA JFK ## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504MQ JFK ## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517MQ EWR ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` -- #### Find the flights that left earliest ```r arrange(flights, dep_delay) ``` ``` ## # A tibble: 3 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 12 7 2040 2123 -43 40 2352 48 B6 97 N592JB JFK ## 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL 1715 N612DL LGA ## 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV 5713 N825AS LGA ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## Select columns with `select()` Sometimes we get data sets that have hundreds or thousands of variables 😬 -- `select()` gives us a way to narrow down to only those variables we're interested in! 🎉 -- #### Helper functions include - `starts_with("abc")`: matches names that begin with “abc”. - `ends_with("xyz")`: matches names that end with “xyz”. - `contains("ijk")`: matches names that contain “ijk”. - `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters (more in **Chapter 14: Strings**) - `num_range("x", 1:3)`: matches x1, x2 and x3. -- - ⭐ `everything()`: useful when you want to move a few variables to the start of the data frame --- ### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. ```r flights ``` ``` ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- #### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. ```r select(flights, dep_time, dep_delay, arr_time, arr_delay) ``` ``` ## # A tibble: 336,776 x 4 ## dep_time dep_delay arr_time arr_delay ## <int> <dbl> <int> <dbl> ## 1 517 2 830 11 ## 2 533 4 850 20 ## 3 542 2 923 33 ## 4 544 -1 1004 -18 ## 5 554 -6 812 -25 ## 6 554 -4 740 12 ## 7 555 -5 913 19 ## 8 557 -3 709 -14 ## 9 557 -3 838 -8 ## 10 558 -2 753 8 ## # … with 336,766 more rows ``` --- #### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. ```r select(flights, starts_with("dep") | starts_with("arr")) ``` ``` ## # A tibble: 336,776 x 4 ## dep_time dep_delay arr_time arr_delay ## <int> <dbl> <int> <dbl> ## 1 517 2 830 11 ## 2 533 4 850 20 ## 3 542 2 923 33 ## 4 544 -1 1004 -18 ## 5 554 -6 812 -25 ## 6 554 -4 740 12 ## 7 555 -5 913 19 ## 8 557 -3 709 -14 ## 9 557 -3 838 -8 ## 10 558 -2 753 8 ## # … with 336,766 more rows ``` --- #### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. ```r select(flights, ends_with("time") | ends_with("delay")) ``` ``` ## # A tibble: 336,776 x 7 ## dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay arr_delay ## <int> <int> <int> <int> <dbl> <dbl> <dbl> ## 1 517 515 830 819 227 2 11 ## 2 533 529 850 830 227 4 20 ## 3 542 540 923 850 160 2 33 ## 4 544 545 1004 1022 183 -1 -18 ## 5 554 600 812 837 116 -6 -25 ## 6 554 558 740 728 150 -4 12 ## 7 555 600 913 854 158 -5 19 ## 8 557 600 709 723 53 -3 -14 ## 9 557 600 838 846 140 -3 -8 ## 10 558 600 753 745 138 -2 8 ## # … with 336,766 more rows ``` ## Not quite! Why? --- ## Add new variables with `mutate()` Helps you add new columns that are functions of existing columns -- - always adds columns at the end of your dataset -- - if you want to only keep the new variables, you can use `transmute()` --- ### Add new variables with `mutate()` .pull-left[ Can be used with _arithmetic operators_ - `+` - `-` - `*` - `/` - `^` - `sum()` ] -- .pull-right[ _modular arithmetic_ - `%/%` (integer division) - `%%` (remainder) - `log()` - `log2()` - `log10()` ] -- .pull-left[ - _logical comparisons_: `<`, `<=`, `>`, `>=`, `!=` - _offsets_: `lead()` and `lag()` - _ranking_: `min_rank()`, `row_number()`, `dense_rank()`, and more! ] --- ## Grouped summaries with `summarize()` Collapses a data frame into a single row and more useful when combined with `group_by()` -- Arguably **most** useful when used with the pipe `%>%` operator -- Go from this: ```r by_dest <- group_by(flights, dest) delay <- summarize(by_dest, count = n(), dis = mean(distance, na.rm = TRUE), delay = mean(arr_del, na.rm = TRUE)) ``` -- To this! 🎉 ```r flights %>% group_by(dest) %>% summarize(by_dest, count = n(), dis = mean(distance, na.rm = TRUE), delay = mean(arr_del, na.rm = TRUE)) ``` #### Spot the differences! --- ## Grouped mutate (and filters) Most useful in combination with `summarize()` -- For example: ```r popular_dests <- flights %>% group_by(dest) %>% filter(n() > 365) ``` ### What does it do?