class: center, middle, inverse, title-slide # Advanced R ## Chapter 3 - Vectors ### Orry Messer 4 R4DS Reading Group, Cohort 3 ###
@orrymr
### 16/08/2020 --- # Introduction - Vectors are the most important family of data types in base R - Vectors come in two (delicious) flavours: .pull-left[ <center> <b> Atomic Vectors </b> <br/> All elements must have the same type </center> ] .pull-right[ <center> <b> Lists </b> <br/> elements can have different types </center> ] - `NULL`? - Not a vector (but closely related - serves role of generic zero length vector, but we will get to that) <img src="./img/1.PNG" width="25%" style="display: block; margin: auto;" /> - Attributes (named list of arbitrary metadata). Two particularly important attributes: - dimension (turns vectors into matrices and arrays) - class (powers S3) - factors, dates, times, data frames and tibbles are all S3 objects! --- # Outline - 3.2 Atomic Vectors - 3.3 Attributes - 3.4 S3 Atomic Vectors - 3.5 Lists - 3.6 Data Frames and Tibbles - 3.7 `NULL` --- # Atomic Vectors .pull-left[ - Four primary types of atomic vectors: - logical - integer - double - character ] .pull-right[ - Two rares: - complex - raw ] <img src="./img/2.PNG" width="25%" style="display: block; margin: auto;" /> ```r lgl_var <- c(TRUE, FALSE) int_var <- c(1L, 6L, 10L) dbl_var <- c(1, 2.5, 4.5) chr_var <- c('these are', "some strings") ``` - 4 Atomics: all elements have the same types. `typeof()` to determine type... of. --- # NA's, Testing and Coercion - NA's (which R uses for missing values) are infectious. - Test vectors of given type by using `is.*()` - for example, `is.integer()` - For atomic vectors, need same type across the entire vector. - So, when combining different types, coerced in a fixed order: character -> double -> integer -> logical ```r c(TRUE) ``` ``` ## [1] TRUE ``` ```r c(TRUE, 42L) ``` ``` ## [1] 1 42 ``` ```r c(TRUE, 42L, 3.14) ``` ``` ## [1] 1.00 42.00 3.14 ``` ```r c(TRUE, 42L, 3.14, "elephant") ``` ``` ## [1] "TRUE" "42" "3.14" "elephant" ``` --- # NA vs NULL? - `NULL` - Has unique type (`NULL`) - Length 0 - Can't have attributes - Used for representing empty vector - Represent absent vector (such as in a function argument) - `NA` - `NA` indicated <i>element</i> of vector is absent - Confusingly, SQL `NULL` is equivalent R's `NA` --- # Attributes - Name-value pairs that attach metadata to an object - Get/Set individual attributes with `attr()`, thusly: ```r a <- 1:3 attr(a, "x") <- "abcdef" attr(a, "x") ``` ``` ## [1] "abcdef" ``` - Get/Set en masse with `attributes()`/`structure()`, respectively: ```r a <- structure( 1:3, x = "abcdef", y = "why?" ) attributes(a) ``` ``` ## $x ## [1] "abcdef" ## ## $y ## [1] "why?" ``` --- # Attributes (Generally) Ephemeral (1) - Using the variables `a` defined in the last slide.. ```r attributes(a) ``` ``` ## $x ## [1] "abcdef" ## ## $y ## [1] "why?" ``` ```r attributes(a[1]) ``` ``` ## NULL ``` ```r attributes(sum(a)) ``` ``` ## NULL ``` --- # Attributes (Generally) Ephemeral (2) - Only 2 attributesd routinely preserved: - <b>names</b>, which is itself a character vector giving each element a name - <b>dim</b>, which is itself an integer vector, used to turn vectors into matrices/arrays. - To preserve other attributes, need to create your own S3 class --- # names() - 3 ways to name a vector: ```r # When creating it: x <- c(a = 1, b = 2, c = 3) # By assigning a character vector to names() x <- 1:3 names(x) <- c("a", "b", "c") # Inline, with setNames(): x <- setNames(1:3, c("a", "b", "c")) ``` --- # dim() - Adding a `dim` attribute to a vector allows it to behave like a 2-dimensional <b>matrix</b> or a multi-dimensional <b>array</b>. ```r # Two scalar arguments specify row and column sizes a <- matrix(1:6, nrow = 2, ncol = 3) dim(a) ``` ``` ## [1] 2 3 ``` ```r b <- array(1:12, c(2, 3, 2)) dim(b) ``` ``` ## [1] 2 3 2 ``` ```r c <- 1:6 dim(c) <- c(3,2) ``` - A vector without a `dim` attribute set is often thought of as 1-dimensional, but actually has `NULL` dimensions. - You also can have matrices with a single row or single column, or arrays with a single dimension. --- # S3 Atomic Vectors - Having a class attribute turns an object into an S3 object - Means it will behave differently from regular vector when passed into <b>generic</b> function - 4 important S3 vectors in base R - factor - Date - POSIXct - difftime <img src="./img/3.PNG" width="25%" style="display: block; margin: auto;" /> --- # Factors (1) - Used to store categorical data - Can only contained predefined values - built on top of integer vector, with two attributes: `class` = "factor" and `levels` which define allowed values. ```r x <- factor(c("a", "b", "b", "a")) x ``` ``` ## [1] a b b a ## Levels: a b ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` ```r attributes(x) ``` ``` ## $levels ## [1] "a" "b" ## ## $class ## [1] "factor" ``` --- # Factors (2) - Ordered factors - order is meaningful ```r grade <- ordered(c("b", "b", "a", "c"), levels = c("c", "b", "a")) grade ``` ``` ## [1] b b a c ## Levels: c < b < a ``` --- # Dates - Built on top of double vectors - Have `class` = "Date". No other attributes. ```r the_day_this_slide_was_rendered <- Sys.Date() the_day_this_slide_was_rendered ``` ``` ## [1] "2020-08-20" ``` ```r typeof(the_day_this_slide_was_rendered) ``` ``` ## [1] "double" ``` ```r attributes(the_day_this_slide_was_rendered) ``` ``` ## $class ## [1] "Date" ``` ```r unclass(the_day_this_slide_was_rendered) # Days since 1970-01-01 ``` ``` ## [1] 18494 ``` --- # Date-times (1) - Like dates, also built on double vectors - 2 ways: POSIXct vs POSIClt - We'll focus on POSIXct ```r then_ct <- as.POSIXct("2018-08-01 22:00", tz = "UTC") then_ct ``` ``` ## [1] "2018-08-01 22:00:00 UTC" ``` ```r typeof(then_ct) # Let's not forget, it was built on a double vector ``` ``` ## [1] "double" ``` ```r attributes(then_ct) ``` ``` ## $class ## [1] "POSIXct" "POSIXt" ## ## $tzone ## [1] "UTC" ``` --- # Date-timess (2) - tzone attribute controls how date-time is formatted - why multiple classes? --- # Durations - Represent amount of time between dates/date-times - Built on top of doubles - Have `units` attribute to determine how integer should be interpreted ```r one_week_1 <- as.difftime(1, units = "weeks") one_week_1 ``` ``` ## Time difference of 1 weeks ``` ```r attributes(one_week_1) ``` ``` ## $class ## [1] "difftime" ## ## $units ## [1] "weeks" ``` ```r one_week_2 <- as.difftime(7, units = "days") one_week_2 ``` ``` ## Time difference of 7 days ``` ```r attributes(one_week_2) ``` ``` ## $class ## [1] "difftime" ## ## $units ## [1] "days" ``` --- # Lists (1) - Each element can be any type <img src="./img/4.PNG" width="50%" style="display: block; margin: auto;" /> - Although technically, each element is the same type, because it's just a reference (Section 2.3.3) - Because made up of references, total size may be smaller than you expect: ```r lobstr::obj_size(mtcars) ``` ``` ## 7,208 B ``` ```r l2 <- list(mtcars, mtcars, mtcars, mtcars) lobstr::obj_size(l2) ``` ``` ## 7,288 B ``` --- # Lists (2) - Recursive ```r l3 <- list(list(list(1))) ``` <img src="./img/5.PNG" width="20%" style="display: block; margin: auto;" /> ```r l4 <- list(list(1, 2), c(3, 4)) str(l4) ``` ``` ## List of 2 ## $ :List of 2 ## ..$ : num 1 ## ..$ : num 2 ## $ : num [1:2] 3 4 ``` --- # Lists (3) ```r l5 <- c(list(1, 2), c(3, 4)) # If given a combination of atomic vector and list, c() will coerce vectors to lists before comibining them str(l5) #NB, it's a list, even though we called c() ``` ``` ## List of 4 ## $ : num 1 ## $ : num 2 ## $ : num 3 ## $ : num 4 ``` ```r l6 <- c(c(1, 2), c(3, 4)) str(l6) # Still an atomic vector... ``` ``` ## num [1:4] 1 2 3 4 ``` - `typeof()` list is `list`. - `is.list()` - test for list - coerce to list with `as.list()` - list-matrices and list-arrays exist. (Remember, we previously created arrays/matrices from atomic vectors) --- # Data frames and tibbles - Data frames and tibbles are lists of vectors - They are S3 vectors (see the "class" attribute) <img src="./img/6.PNG" width="20%" style="display: block; margin: auto;" /> ```r df1 <- data.frame(x = 1:3, y = letters[1:3]) attributes(df1) ``` ``` ## $names ## [1] "x" "y" ## ## $class ## [1] "data.frame" ## ## $row.names ## [1] 1 2 3 ``` --- ### Tibbles (1) - Frustration with data frames led to tibbles ```r df2 <- tibble(x = 1:3, y = letters[1:3]) # still a list of vectors attributes(df2) ``` ``` ## $names ## [1] "x" "y" ## ## $row.names ## [1] 1 2 3 ## ## $class ## [1] "tbl_df" "tbl" "data.frame" ``` --- ### Tibbles (2) - Lazy and surly - Lazy - Don't coerce input (which is why you need stringsAsFactors = FALSE for data frames) - Don't automatically convert non-syntactic names: ```r names(data.frame(`1` = 1)) ``` ``` ## [1] "X1" ``` ```r names(tibble(`1` = 1)) ``` ``` ## [1] "1" ``` - tibbles do not support row names - tibbles have a nicer print method - subsetting: `[` always returns tibble & `$` doesn't do partial matching --- ### List Columns (1) - Data frames support list columns, but need `I()`: ```r df <- data.frame(x = 1:3) df$y <- list(1:2, 1:3, 1:4) data.frame( x = 1:3, y = I(list(1:2, 1:3, 1:4)) ) ``` ``` ## x y ## 1 1 1, 2 ## 2 2 1, 2, 3 ## 3 3 1, 2, 3, 4 ``` --- ### List Columns (2) - Easier with tibbles: ```r tibble( x = 1:3, y = list(1:2, 1:3, 1:4) ) ``` ``` ## # A tibble: 3 x 2 ## x y ## <int> <list> ## 1 1 <int [2]> ## 2 2 <int [3]> ## 3 3 <int [4]> ``` - Can also have matrix / array / data frame columns ---