Chapter 4 Subsetting
4.1 Introduction
"There are three subsetting operators [
. [[
, $
. What is the distinction between an operator and a function? When you look up the help page it brings up the same page for all three extraction methods. What are their distinctions and do their definitions change based on what you’re subsetting? Can we make a table?
[ | [[ | $ | |
---|---|---|---|
ATOMIC | RETURNS VECTOR WITH ONE ELEMENT | SAME AS [ | NOPE! |
LIST | RETURNS A LIST | RETURNS SINGLE ELEMENT FROM WITHIN LIST | RETURN SINGLE ELEMENT FROM LIST [CAN ONLY USE WHEN LIST VECTOR HAS A NAME] |
MATRIX | RETURNS A VECTOR | RETURNS A VECTOR OR SINGLE VALUE | NOPE! |
DATA FRAME | RETURNS A VECTOR OR DATA FRAME | RETURNS VECTOR/LIST/MATRIX OR SINGLE VALUE | RETURNS VECTOR/LIST/MATRIX USING COLUMN NAME |
TIBBLE | RETURNS A TIBBLE | RETURNS A VECTOR OR SINGLE VALUE | RETURNS THE STR OF THE COLUMN - TIBBLE/LIST/MATRIX |
If we think of everything as sets (which have the properties of 0,1, or many elements), if the set has 1 element it only contains itself and NULL
subsets. Before you subset using [
or [[
count the elements in the set. If it has zero elements you are done, if it has one element [
will return itself - to go further you need to use [[
to return its contents. If there is more than one element in the set then [
will return those elements. You can read more about subsetting here
4.2.1 Selecting multiple elements
Why is numeric(0)
“helpful for test data?”
This is more of a general comment that one should make sure one’s code doesn’t crash with vectors of zero length (or data frames with zero rows)
Why is subsetting with factors “not a good idea”
Hadley’s notes seem to say subsetting with factors uses the “integer vector of levels” - and if they all have the same level, it’ll just return the first argument. Subsetting a factor vector leaves the factor levels behind unless you explicitly drop the unused levels
4.2.2 lists
We’ve been talking about $
as a shorthand for [[
. Using the example list x <- list(1:3, "a", 4:6)
can we use x$1
as shorthand for x[[1]]
?
The “shorthand” refers to using the name of the vector to extract the vector. If we give 1:3
a name such as test = 1:3
## [1] TRUE TRUE TRUE
As such, $
is a shorthand for x[["name_of_vector"]]
and not shorthand for x[[index]]
4.3.1 [[
The book states:
While you must use [[ when working with lists, I’d also recommend using it with atomic vectors whenever you want to extract a single value. For example, instead of writing:
It’s better to write
Why? Can we see this in action by giving x
, out
, and fun
real life values?
If we have a vector
We can use [
or [[
to extract the third element of df_x
## [1] "Book"
## [1] "Book"
But in the case where we want to extract an element from a list [
and [[
no longer give us the same results
## $C
## [1] "Book"
## [1] "Book"
Because using [[
returns “one element of this vector” in both cases, it makes sense to default to [[
instead of [
since it will reliably return a single element.
4.3.5 Exercise
The question asks to describe the upper.tri
function - let’s dig into it!
## [,1] [,2] [,3] [,4] [,5]
## [1,] FALSE TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE TRUE TRUE TRUE
## [3,] FALSE FALSE FALSE TRUE TRUE
## [4,] FALSE FALSE FALSE FALSE TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE
We see that it returns the upper triangle of the matrix. But I wanted to walk through how this function actually works and what is meant in the solution manual by leveraging .row(dim(x)) <= .col(dim(x))
.
# ?upper.tri
function (x, diag = FALSE)
{
d <- dim(x)
# if you have an array thats more than 2 dimension
# we need to flatten it to a matrix
if (length(d) != 2L)
d <- dim(as.matrix(x))
if (diag)
# this is our subsetting logical!
.row(d) <= .col(d)
else .row(d) < .col(d)
}
The function .row()
and .col()
return a matrix of integers indicating their row number
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 2 2 2 2 2
## [3,] 3 3 3 3 3
## [4,] 4 4 4 4 4
## [5,] 5 5 5 5 5
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 1 2 3 4 5
## [3,] 1 2 3 4 5
## [4,] 1 2 3 4 5
## [5,] 1 2 3 4 5
## [,1] [,2] [,3] [,4] [,5]
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE TRUE TRUE TRUE TRUE
## [3,] FALSE FALSE TRUE TRUE TRUE
## [4,] FALSE FALSE FALSE TRUE TRUE
## [5,] FALSE FALSE FALSE FALSE TRUE
Is there a high level meaning to a .
before function? Does this refer to internal functions? [see: ?row
vs ?.row
]
Objects in the global environment prefixed with .
are hidden in the R (and RStudio) environment panes - so functions prefixed as such are not visible unless you do ls(all=TRUE)
. Read more here and (here)[https://stackoverflow.com/questions/7526467/what-does-the-dot-mean-in-r-personal-preference-naming-convention-or-more]
4.3.3 Missing and OOB
Let’s walk through examples of each
LOGICAL ATOMIC
LIST
4.5.8 Logical subsetting
“Remember to use the vector Boolean operators &
and |
, not the short-circuiting scalar operators &&
and ||
, which are more useful inside if statements.”
Can we go over the difference between &
and &&
(and |
vs ||
) I use brute force to figure out which ones I need…
&&
and ||
only ever return a single (scalar, length-1 vector) TRUE
or FALSE
value, whereas |
and &
return a vector after doing element-by-element comparisons.
The only place in R you routinely use a scalar TRUE
/FALSE
value is in the conditional of an if
statement, so you’ll often see &&
or ||
used in idioms like: if (length(x) > 0 && any(is.na(x))) { do.something() }
In most other instances you’ll be working with vectors and use &
and |
instead.
Using &&
or ||
results in some unexpected behavior - which could be a big performance gain in some cases:
||
will not evaluate the second argument when the first isTRUE
&&
will not evaluate the second argument when the first isFALSE
true_one <- function() { print("true_one evaluated."); TRUE}
true_two <- function() { print("true_two evaluated."); TRUE}
# arguments are evaluated lazily. Unexpected behavior can result:
c(T, true_one()) && c(T, true_two())
## [1] "true_one evaluated."
## [1] "true_two evaluated."
## [1] TRUE
## [1] "true_one evaluated."
## [1] "true_two evaluated."
## [1] FALSE
## [1] "true_one evaluated."
## [1] FALSE
## [1] "true_one evaluated."
## [1] FALSE
## [1] "true_one evaluated."
## [1] TRUE
## [1] "true_one evaluated."
## [1] TRUE
## [1] "true_one evaluated."
## [1] "true_two evaluated."
## [1] TRUE
## [1] "true_one evaluated."
## [1] "true_two evaluated."
## [1] FALSE
Read more about Special Primatives here
4.5.8 Boolean algebra
The unwhich()
function takes a boolean and turns it into a numeric - would this ever be useful? How?
XXX
“x[-which(y)]
is not equivalent to x[!y]
: if y
is all FALSE, which(y)
will be integer(0)
and -integer(0)
is still integer(0)
, so you’ll get no values, instead of all values.”
Can we come up with an example for this plugging in values for x
and y
## logical(0)
## [1] TRUE FALSE