+ - 0:00:00
Notes for current slide
Notes for next slide

Advanced R

Chapter 3 - Vectors

Orry Messer 4 R4DS Reading Group, Cohort 3

@orrymr

16/08/2020

1 / 27

Introduction

  • Vectors are the most important family of data types in base R

  • Vectors come in two (delicious) flavours:

Atomic Vectors
All elements must have the same type
Lists
elements can have different types
  • NULL? - Not a vector (but closely related - serves role of generic zero length vector, but we will get to that)

  • Attributes (named list of arbitrary metadata). Two particularly important attributes:
    • dimension (turns vectors into matrices and arrays)
    • class (powers S3)
  • factors, dates, times, data frames and tibbles are all S3 objects!
2 / 27

Outline

  • 3.2 Atomic Vectors

  • 3.3 Attributes

  • 3.4 S3 Atomic Vectors

  • 3.5 Lists

  • 3.6 Data Frames and Tibbles

  • 3.7 NULL

3 / 27

Atomic Vectors

  • Four primary types of atomic vectors:
    • logical
    • integer
    • double
    • character
  • Two rares:
    • complex
    • raw

lgl_var <- c(TRUE, FALSE)
int_var <- c(1L, 6L, 10L)
dbl_var <- c(1, 2.5, 4.5)
chr_var <- c('these are', "some strings")
  • 4 Atomics: all elements have the same types. typeof() to determine type... of.
4 / 27

NA's, Testing and Coercion

  • NA's (which R uses for missing values) are infectious.
  • Test vectors of given type by using is.*() - for example, is.integer()
  • For atomic vectors, need same type across the entire vector.
    • So, when combining different types, coerced in a fixed order: character -> double -> integer -> logical
c(TRUE)
## [1] TRUE
c(TRUE, 42L)
## [1] 1 42
c(TRUE, 42L, 3.14)
## [1] 1.00 42.00 3.14
c(TRUE, 42L, 3.14, "elephant")
## [1] "TRUE" "42" "3.14" "elephant"
5 / 27

NA vs NULL?

  • NULL

    • Has unique type (NULL)
    • Length 0
    • Can't have attributes
    • Used for representing empty vector
    • Represent absent vector (such as in a function argument)
  • NA

    • NA indicated element of vector is absent
    • Confusingly, SQL NULL is equivalent R's NA
6 / 27

Attributes

  • Name-value pairs that attach metadata to an object
  • Get/Set individual attributes with attr(), thusly:
a <- 1:3
attr(a, "x") <- "abcdef"
attr(a, "x")
## [1] "abcdef"
  • Get/Set en masse with attributes()/structure(), respectively:
a <- structure(
1:3,
x = "abcdef",
y = "why?"
)
attributes(a)
## $x
## [1] "abcdef"
##
## $y
## [1] "why?"
7 / 27

Attributes (Generally) Ephemeral (1)

  • Using the variables a defined in the last slide..
attributes(a)
## $x
## [1] "abcdef"
##
## $y
## [1] "why?"
attributes(a[1])
## NULL
attributes(sum(a))
## NULL
8 / 27

Attributes (Generally) Ephemeral (2)

  • Only 2 attributesd routinely preserved:
    • names, which is itself a character vector giving each element a name
    • dim, which is itself an integer vector, used to turn vectors into matrices/arrays.
  • To preserve other attributes, need to create your own S3 class
9 / 27

names()

  • 3 ways to name a vector:
# When creating it:
x <- c(a = 1, b = 2, c = 3)
# By assigning a character vector to names()
x <- 1:3
names(x) <- c("a", "b", "c")
# Inline, with setNames():
x <- setNames(1:3, c("a", "b", "c"))
10 / 27

dim()

  • Adding a dim attribute to a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array.
# Two scalar arguments specify row and column sizes
a <- matrix(1:6, nrow = 2, ncol = 3)
dim(a)
## [1] 2 3
b <- array(1:12, c(2, 3, 2))
dim(b)
## [1] 2 3 2
c <- 1:6
dim(c) <- c(3,2)
  • A vector without a dim attribute set is often thought of as 1-dimensional, but actually has NULL dimensions.
  • You also can have matrices with a single row or single column, or arrays with a single dimension.
11 / 27

S3 Atomic Vectors

  • Having a class attribute turns an object into an S3 object
  • Means it will behave differently from regular vector when passed into generic function
  • 4 important S3 vectors in base R
    • factor
    • Date
    • POSIXct
    • difftime

12 / 27

Factors (1)

  • Used to store categorical data
  • Can only contained predefined values
  • built on top of integer vector, with two attributes: class = "factor" and levels which define allowed values.
x <- factor(c("a", "b", "b", "a"))
x
## [1] a b b a
## Levels: a b
typeof(x)
## [1] "integer"
attributes(x)
## $levels
## [1] "a" "b"
##
## $class
## [1] "factor"
13 / 27

Factors (2)

  • Ordered factors - order is meaningful
grade <- ordered(c("b", "b", "a", "c"), levels = c("c", "b", "a"))
grade
## [1] b b a c
## Levels: c < b < a
14 / 27

Dates

  • Built on top of double vectors
  • Have class = "Date". No other attributes.
the_day_this_slide_was_rendered <- Sys.Date()
the_day_this_slide_was_rendered
## [1] "2020-08-20"
typeof(the_day_this_slide_was_rendered)
## [1] "double"
attributes(the_day_this_slide_was_rendered)
## $class
## [1] "Date"
unclass(the_day_this_slide_was_rendered) # Days since 1970-01-01
## [1] 18494
15 / 27

Date-times (1)

  • Like dates, also built on double vectors
    • 2 ways: POSIXct vs POSIClt
    • We'll focus on POSIXct
then_ct <- as.POSIXct("2018-08-01 22:00", tz = "UTC")
then_ct
## [1] "2018-08-01 22:00:00 UTC"
typeof(then_ct) # Let's not forget, it was built on a double vector
## [1] "double"
attributes(then_ct)
## $class
## [1] "POSIXct" "POSIXt"
##
## $tzone
## [1] "UTC"
16 / 27

Date-timess (2)

  • tzone attribute controls how date-time is formatted
  • why multiple classes?
17 / 27

Durations

  • Represent amount of time between dates/date-times
  • Built on top of doubles
  • Have units attribute to determine how integer should be interpreted
one_week_1 <- as.difftime(1, units = "weeks")
one_week_1
## Time difference of 1 weeks
attributes(one_week_1)
## $class
## [1] "difftime"
##
## $units
## [1] "weeks"
one_week_2 <- as.difftime(7, units = "days")
one_week_2
## Time difference of 7 days
attributes(one_week_2)
## $class
## [1] "difftime"
##
## $units
## [1] "days"
18 / 27

Lists (1)

  • Each element can be any type

  • Although technically, each element is the same type, because it's just a reference (Section 2.3.3)
  • Because made up of references, total size may be smaller than you expect:
lobstr::obj_size(mtcars)
## 7,208 B
l2 <- list(mtcars, mtcars, mtcars, mtcars)
lobstr::obj_size(l2)
## 7,288 B
19 / 27

Lists (2)

  • Recursive
l3 <- list(list(list(1)))

l4 <- list(list(1, 2), c(3, 4))
str(l4)
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ : num [1:2] 3 4
20 / 27

Lists (3)

l5 <- c(list(1, 2), c(3, 4)) # If given a combination of atomic vector and list, c() will coerce vectors to lists before comibining them
str(l5) #NB, it's a list, even though we called c()
## List of 4
## $ : num 1
## $ : num 2
## $ : num 3
## $ : num 4
l6 <- c(c(1, 2), c(3, 4))
str(l6) # Still an atomic vector...
## num [1:4] 1 2 3 4
  • typeof() list is list.
  • is.list() - test for list
  • coerce to list with as.list()
  • list-matrices and list-arrays exist. (Remember, we previously created arrays/matrices from atomic vectors)
21 / 27

Data frames and tibbles

  • Data frames and tibbles are lists of vectors
  • They are S3 vectors (see the "class" attribute)

df1 <- data.frame(x = 1:3, y = letters[1:3])
attributes(df1)
## $names
## [1] "x" "y"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3
22 / 27

Tibbles (1)

  • Frustration with data frames led to tibbles
df2 <- tibble(x = 1:3, y = letters[1:3]) # still a list of vectors
attributes(df2)
## $names
## [1] "x" "y"
##
## $row.names
## [1] 1 2 3
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
23 / 27

Tibbles (2)

  • Lazy and surly
  • Lazy
    • Don't coerce input (which is why you need stringsAsFactors = FALSE for data frames)
    • Don't automatically convert non-syntactic names:
names(data.frame(`1` = 1))
## [1] "X1"
names(tibble(`1` = 1))
## [1] "1"
  • tibbles do not support row names
  • tibbles have a nicer print method
  • subsetting: [ always returns tibble & $ doesn't do partial matching
24 / 27

List Columns (1)

  • Data frames support list columns, but need I():
df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
data.frame(
x = 1:3,
y = I(list(1:2, 1:3, 1:4))
)
## x y
## 1 1 1, 2
## 2 2 1, 2, 3
## 3 3 1, 2, 3, 4
25 / 27

List Columns (2)

  • Easier with tibbles:
tibble(
x = 1:3,
y = list(1:2, 1:3, 1:4)
)
## # A tibble: 3 x 2
## x y
## <int> <list>
## 1 1 <int [2]>
## 2 2 <int [3]>
## 3 3 <int [4]>
  • Can also have matrix / array / data frame columns
26 / 27
27 / 27

Introduction

  • Vectors are the most important family of data types in base R

  • Vectors come in two (delicious) flavours:

Atomic Vectors
All elements must have the same type
Lists
elements can have different types
  • NULL? - Not a vector (but closely related - serves role of generic zero length vector, but we will get to that)

  • Attributes (named list of arbitrary metadata). Two particularly important attributes:
    • dimension (turns vectors into matrices and arrays)
    • class (powers S3)
  • factors, dates, times, data frames and tibbles are all S3 objects!
2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow