Introduction
Filter rows with filter()
Introduction
Filter rows with filter()
Arrange rows with arrange()
Introduction
Filter rows with filter()
Arrange rows with arrange()
Select columns with select()
Introduction
Filter rows with filter()
Arrange rows with arrange()
Select columns with select()
Add new variables with mutate()
Introduction
Filter rows with filter()
Arrange rows with arrange()
Select columns with select()
Add new variables with mutate()
Grouped summaries with summarize()
Introduction
Filter rows with filter()
Arrange rows with arrange()
Select columns with select()
Add new variables with mutate()
Grouped summaries with summarize()
Grouped mutate (and filters)
nycflights13
📦flights
## # A tibble: 336,776 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>
head(flights, 4)
## # A tibble: 4 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
head(flights, 4)
## # A tibble: 4 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
int
: integers dbl
: doubles, or real numberschr
: character vectors, or strings dttm
: date-times (a date + a time)head(flights, 4)
## # A tibble: 4 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
int
: integers dbl
: doubles, or real numberschr
: character vectors, or strings dttm
: date-times (a date + a time)lgl
: logical, vectors with only TRUE or FALSEfctr
: factors, categorical variablesdate
: datesfilter()
).arrange()
).select()
).mutate()
).summarise()
).filter()
).arrange()
).select()
).mutate()
).summarise()
).group_by()
which gets the above functions to operate group-by-group rather than on the entire datasetfilter()
Lets you subset observations based on their values.
filter()
Lets you subset observations based on their values.
filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## # … with 832 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,## # time_hour <dttm>
R provides the standard suite of comparison operators:
>
>=
<
<=
!=
(not equal)==
(equal)&
is "and"|
is "or"!
is "not"NA
s a.k.a. "not availables" represent an unknown valueMissing values a.k.a. NA
s a.k.a. "not availables" represent an unknown value
They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown
Missing values a.k.a. NA
s a.k.a. "not availables" represent an unknown value
They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown
filter()
only includes rows where the condition is TRUE
(i.e. it includes both FALSE
and NA
values)
Missing values a.k.a. NA
s a.k.a. "not availables" represent an unknown value
They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown
filter()
only includes rows where the condition is TRUE
(i.e. it includes both FALSE
and NA
values)
If you want to preserve missing values you must ask for them explicitly:
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 2 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO"
## # A tibble: 2 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO"
# option 1flights %>% filter(carrier = "UA" | "AA" | "DL")# option 2flights %>% filter(carrier == "UA" | "AA" | "DL")# option 3flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL")# option 4flights %>% filter(carrier %in% c("UA", "AA", "DL"))
flights %>% filter(carrier = "UA" | "AA" | "DL")
Error: Problem with
filter()
input..1
. x Input..1
is named. ℹ This usually means that you've used=
instead of==
. ℹ Did you meancarrier == "UA" | "AA" | "DL"
?
flights %>% filter(carrier == "UA" | "AA" | "DL")
Error: Problem with
filter()
input..1
. x operations are possible only for numeric, logical or complex types ℹ Input..1
iscarrier == "UA" | "AA" | "DL"
.
flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL")
## # A tibble: 139,504 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 6 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## 7 2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK ## 8 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR ## 9 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA ## 10 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR ## # … with 139,494 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>
flights %>% filter(carrier %in% c("UA", "AA", "DL"))
## # A tibble: 139,504 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 6 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## 7 2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK ## 8 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR ## 9 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA ## 10 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR ## # … with 139,494 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>
arrange()
Takes a data frame and orders it by a set of column names (or more complicated expressions)
arrange()
Takes a data frame and orders it by a set of column names (or more complicated expressions)
desc()
can be used to reorder a column in desc
ending orderarrange()
Takes a data frame and orders it by a set of column names (or more complicated expressions)
desc()
can be used to reorder a column in desc
ending orderNA
s) are always sorted at the endflights
## # A tibble: 336,776 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>
arrange(flights, desc(dep_delay))
## # A tibble: 3 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA JFK ## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504MQ JFK ## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517MQ EWR ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, desc(dep_delay))
## # A tibble: 3 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA JFK ## 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504MQ JFK ## 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517MQ EWR ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, dep_delay)
## # A tibble: 3 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 12 7 2040 2123 -43 40 2352 48 B6 97 N592JB JFK ## 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL 1715 N612DL LGA ## 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV 5713 N825AS LGA ## # … with 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
select()
Sometimes we get data sets that have hundreds or thousands of variables 😬
select()
Sometimes we get data sets that have hundreds or thousands of variables 😬
select()
gives us a way to narrow down to only those variables we're interested in! 🎉
select()
Sometimes we get data sets that have hundreds or thousands of variables 😬
select()
gives us a way to narrow down to only those variables we're interested in! 🎉
starts_with("abc")
: matches names that begin with “abc”.ends_with("xyz")
: matches names that end with “xyz”.contains("ijk")
: matches names that contain “ijk”.matches("(.)\\1")
: selects variables that match a regular expression. This one matches any variables that contain repeated characters (more in Chapter 14: Strings)num_range("x", 1:3)
: matches x1, x2 and x3.select()
Sometimes we get data sets that have hundreds or thousands of variables 😬
select()
gives us a way to narrow down to only those variables we're interested in! 🎉
starts_with("abc")
: matches names that begin with “abc”.ends_with("xyz")
: matches names that end with “xyz”.contains("ijk")
: matches names that contain “ijk”.matches("(.)\\1")
: selects variables that match a regular expression. This one matches any variables that contain repeated characters (more in Chapter 14: Strings)num_range("x", 1:3)
: matches x1, x2 and x3.everything()
: useful when you want to move a few variables to the start of the data framedep_time
, dep_delay
, arr_time
, and arr_delay
from flights.flights
## # A tibble: 336,776 x 19## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ## # … with 336,766 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>
dep_time
, dep_delay
, arr_time
, and arr_delay
from flights.select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4## dep_time dep_delay arr_time arr_delay## <int> <dbl> <int> <dbl>## 1 517 2 830 11## 2 533 4 850 20## 3 542 2 923 33## 4 544 -1 1004 -18## 5 554 -6 812 -25## 6 554 -4 740 12## 7 555 -5 913 19## 8 557 -3 709 -14## 9 557 -3 838 -8## 10 558 -2 753 8## # … with 336,766 more rows
dep_time
, dep_delay
, arr_time
, and arr_delay
from flights.select(flights, starts_with("dep") | starts_with("arr"))
## # A tibble: 336,776 x 4## dep_time dep_delay arr_time arr_delay## <int> <dbl> <int> <dbl>## 1 517 2 830 11## 2 533 4 850 20## 3 542 2 923 33## 4 544 -1 1004 -18## 5 554 -6 812 -25## 6 554 -4 740 12## 7 555 -5 913 19## 8 557 -3 709 -14## 9 557 -3 838 -8## 10 558 -2 753 8## # … with 336,766 more rows
dep_time
, dep_delay
, arr_time
, and arr_delay
from flights.select(flights, ends_with("time") | ends_with("delay"))
## # A tibble: 336,776 x 7## dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay arr_delay## <int> <int> <int> <int> <dbl> <dbl> <dbl>## 1 517 515 830 819 227 2 11## 2 533 529 850 830 227 4 20## 3 542 540 923 850 160 2 33## 4 544 545 1004 1022 183 -1 -18## 5 554 600 812 837 116 -6 -25## 6 554 558 740 728 150 -4 12## 7 555 600 913 854 158 -5 19## 8 557 600 709 723 53 -3 -14## 9 557 600 838 846 140 -3 -8## 10 558 600 753 745 138 -2 8## # … with 336,766 more rows
mutate()
Helps you add new columns that are functions of existing columns
mutate()
Helps you add new columns that are functions of existing columns
mutate()
Helps you add new columns that are functions of existing columns
transmute()
mutate()
Can be used with arithmetic operators
+
-
*
/
^
sum()
mutate()
Can be used with arithmetic operators
+
-
*
/
^
sum()
modular arithmetic
%/%
(integer division)%%
(remainder)log()
log2()
log10()
mutate()
Can be used with arithmetic operators
+
-
*
/
^
sum()
modular arithmetic
%/%
(integer division)%%
(remainder)log()
log2()
log10()
<
, <=
, >
, >=
, !=
lead()
and lag()
min_rank()
, row_number()
, dense_rank()
, and more!summarize()
Collapses a data frame into a single row and more useful when combined with group_by()
summarize()
Collapses a data frame into a single row and more useful when combined with group_by()
Arguably most useful when used with the pipe %>%
operator
summarize()
Collapses a data frame into a single row and more useful when combined with group_by()
Arguably most useful when used with the pipe %>%
operator
Go from this:
by_dest <- group_by(flights, dest)delay <- summarize(by_dest, count = n(), dis = mean(distance, na.rm = TRUE), delay = mean(arr_del, na.rm = TRUE))
summarize()
Collapses a data frame into a single row and more useful when combined with group_by()
Arguably most useful when used with the pipe %>%
operator
Go from this:
by_dest <- group_by(flights, dest)delay <- summarize(by_dest, count = n(), dis = mean(distance, na.rm = TRUE), delay = mean(arr_del, na.rm = TRUE))
To this! 🎉
flights %>% group_by(dest) %>% summarize(by_dest, count = n(), dis = mean(distance, na.rm = TRUE), delay = mean(arr_del, na.rm = TRUE))
Most useful in combination with summarize()
Most useful in combination with summarize()
For example:
popular_dests <- flights %>% group_by(dest) %>% filter(n() > 365)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |