Chapter 4 Subsetting

4.1 Introduction

"There are three subsetting operators [. [[, $. What is the distinction between an operator and a function? When you look up the help page it brings up the same page for all three extraction methods. What are their distinctions and do their definitions change based on what you’re subsetting? Can we make a table?

[ [[ $
ATOMIC RETURNS VECTOR WITH ONE ELEMENT SAME AS [ NOPE!
LIST RETURNS A LIST RETURNS SINGLE ELEMENT FROM WITHIN LIST RETURN SINGLE ELEMENT FROM LIST [CAN ONLY USE WHEN LIST VECTOR HAS A NAME]
MATRIX RETURNS A VECTOR RETURNS A VECTOR OR SINGLE VALUE NOPE!
DATA FRAME RETURNS A VECTOR OR DATA FRAME RETURNS VECTOR/LIST/MATRIX OR SINGLE VALUE RETURNS VECTOR/LIST/MATRIX USING COLUMN NAME
TIBBLE RETURNS A TIBBLE RETURNS A VECTOR OR SINGLE VALUE RETURNS THE STR OF THE COLUMN - TIBBLE/LIST/MATRIX

If we think of everything as sets (which have the properties of 0,1, or many elements), if the set has 1 element it only contains itself and NULL subsets. Before you subset using [ or [[ count the elements in the set. If it has zero elements you are done, if it has one element [ will return itself - to go further you need to use [[ to return its contents. If there is more than one element in the set then [ will return those elements. You can read more about subsetting here

4.2.1 Selecting multiple elements

Why is numeric(0) “helpful for test data?”

This is more of a general comment that one should make sure one’s code doesn’t crash with vectors of zero length (or data frames with zero rows)

Why is subsetting with factors “not a good idea”

Hadley’s notes seem to say subsetting with factors uses the “integer vector of levels” - and if they all have the same level, it’ll just return the first argument. Subsetting a factor vector leaves the factor levels behind unless you explicitly drop the unused levels

4.2.2 lists

We’ve been talking about $ as a shorthand for [[. Using the example list x <- list(1:3, "a", 4:6) can we use x$1 as shorthand for x[[1]]?

The “shorthand” refers to using the name of the vector to extract the vector. If we give 1:3 a name such as test = 1:3

## [1] TRUE TRUE TRUE

As such, $ is a shorthand for x[["name_of_vector"]] and not shorthand for x[[index]]

4.3.1 [[

The book states:

While you must use [[ when working with lists, I’d also recommend using it with atomic vectors whenever you want to extract a single value. For example, instead of writing:

It’s better to write

Why? Can we see this in action by giving x, out, and fun real life values?

If we have a vector

We can use [ or [[ to extract the third element of df_x

## [1] "Book"
## [1] "Book"

But in the case where we want to extract an element from a list [ and [[ no longer give us the same results

## $C
## [1] "Book"
## [1] "Book"

Because using [[ returns “one element of this vector” in both cases, it makes sense to default to [[ instead of [ since it will reliably return a single element.

4.3.5 Exercise

The question asks to describe the upper.tri function - let’s dig into it!

##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] FALSE  TRUE  TRUE  TRUE  TRUE
## [2,] FALSE FALSE  TRUE  TRUE  TRUE
## [3,] FALSE FALSE FALSE  TRUE  TRUE
## [4,] FALSE FALSE FALSE FALSE  TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE

We see that it returns the upper triangle of the matrix. But I wanted to walk through how this function actually works and what is meant in the solution manual by leveraging .row(dim(x)) <= .col(dim(x)).

The function .row() and .col() return a matrix of integers indicating their row number

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    2    2    2    2    2
## [3,]    3    3    3    3    3
## [4,]    4    4    4    4    4
## [5,]    5    5    5    5    5
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    1    2    3    4    5
## [3,]    1    2    3    4    5
## [4,]    1    2    3    4    5
## [5,]    1    2    3    4    5
##       [,1]  [,2]  [,3]  [,4] [,5]
## [1,]  TRUE  TRUE  TRUE  TRUE TRUE
## [2,] FALSE  TRUE  TRUE  TRUE TRUE
## [3,] FALSE FALSE  TRUE  TRUE TRUE
## [4,] FALSE FALSE FALSE  TRUE TRUE
## [5,] FALSE FALSE FALSE FALSE TRUE

Is there a high level meaning to a . before function? Does this refer to internal functions? [see: ?row vs ?.row]

Objects in the global environment prefixed with . are hidden in the R (and RStudio) environment panes - so functions prefixed as such are not visible unless you do ls(all=TRUE). Read more here and (here)[https://stackoverflow.com/questions/7526467/what-does-the-dot-mean-in-r-personal-preference-naming-convention-or-more]

4.5.8 Logical subsetting

“Remember to use the vector Boolean operators & and |, not the short-circuiting scalar operators && and ||, which are more useful inside if statements.”

Can we go over the difference between & and && (and | vs ||) I use brute force to figure out which ones I need…

&& and || only ever return a single (scalar, length-1 vector) TRUE or FALSE value, whereas | and & return a vector after doing element-by-element comparisons.

The only place in R you routinely use a scalar TRUE/FALSE value is in the conditional of an if statement, so you’ll often see && or || used in idioms like: if (length(x) > 0 && any(is.na(x))) { do.something() }

In most other instances you’ll be working with vectors and use & and | instead.

Using && or || results in some unexpected behavior - which could be a big performance gain in some cases:

  • || will not evaluate the second argument when the first is TRUE
  • && will not evaluate the second argument when the first is FALSE
## [1] "true_one evaluated."
## [1] "true_two evaluated."
## [1] TRUE
## [1] "true_one evaluated."
## [1] "true_two evaluated."
## [1] FALSE
## [1] "true_one evaluated."
## [1] FALSE
## [1] "true_one evaluated."
## [1] FALSE
## [1] "true_one evaluated."
## [1] TRUE
## [1] "true_one evaluated."
## [1] TRUE
## [1] "true_one evaluated."
## [1] "true_two evaluated."
## [1] TRUE
## [1] "true_one evaluated."
## [1] "true_two evaluated."
## [1] FALSE

Read more about Special Primatives here

4.5.8 Boolean algebra

The unwhich() function takes a boolean and turns it into a numeric - would this ever be useful? How?

XXX

x[-which(y)] is not equivalent to x[!y]: if y is all FALSE, which(y) will be integer(0) and -integer(0) is still integer(0), so you’ll get no values, instead of all values.”

Can we come up with an example for this plugging in values for x and y

## logical(0)
## [1]  TRUE FALSE