Section 4.1: Introduction
Section 4.2: Selecting multiple elements
Section 4.3: Selecting a single element
Section 4.1: Introduction
Section 4.2: Selecting multiple elements
Section 4.3: Selecting a single element
Section 4.4: Subsetting and assignment
Section 4.5: Applications (Using subsetting to solve problems)
Interrelated concepts to internalise:
There are 3 subsetting operators: [[
, [
, and $
Subsetting operators interact differently with various vector types (e.g. atomic vectors, lists, factors, matrices, and data frames)
Subsetting and assignment can be combined ("subsassignment")
Subsetting complements structure, or
str()
, which shows you all the pieces of an object, but subsetting lets you pull out only the pieces you are interested in.Often useful to use RStudio Viewer, with
View(my_object)
to know which pieces you want to subset
Use
[
to select any number of elements from a vector.
Use
[
to select any number of elements from a vector.
Assume we have a simple vector: x <- c(2.1, 4.2, 3.3, 5.4)
x[c(3, 1)]
## [1] 3.3 2.1
x[-c(3, 1)]
## [1] 4.2 5.4
TRUE
x[c(TRUE, TRUE, FALSE, FALSE)]
## [1] 2.1 4.2
x[x > 3]
## [1] 4.2 3.3 5.4
x[c(TRUE, FALSE)]
is equivalent tox[c(TRUE, FALSE, TRUE, FALSE)]
x[]
## [1] 2.1 4.2 3.3 5.4
x[0]
## numeric(0)
(y <- setNames(x, letters[1:4]))
## a b c d ## 2.1 4.2 3.3 5.4
y[c("d", "c", "a")]
## d c a ## 5.4 3.3 2.1
Using [
always returns a list
[[
and $
, as described in Section 4.3, let you pull out elements of a list.
The most common way of subsetting matrices (2D) and arrays (>2D) is a simple generalisation of 1D subsetting
a <- matrix(1:9, nrow = 3)colnames(a) <- c("A", "B", "C")
a[1:2, ]
## A B C## [1,] 1 4 7## [2,] 2 5 8
a[c(TRUE, FALSE, TRUE), c("B", "A")]
## B A## [1,] 4 1## [2,] 6 3
Consider the matrix below:
(vals = matrix(1:25, ncol = 5, byrow = TRUE))
## [,1] [,2] [,3] [,4] [,5]## [1,] 1 2 3 4 5## [2,] 6 7 8 9 10## [3,] 11 12 13 14 15## [4,] 16 17 18 19 20## [5,] 21 22 23 24 25
vals[c(4, 15)]
## [1] 16 23
select <- matrix(ncol = 2, byrow = TRUE, c( 1, 1, 3, 1, 2, 4))vals[select]
## [1] 1 11 9
Data frames have the characteristics of both lists and matrices.
When subsetting with a single index, they behave like lists and index the columns, so df[1:2]
selects the first two columns.
When subsetting with two indices, they behave like matrices, so df[1:3, ]
selects the first three rows (and all the columns
Given
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
what is the output for:
df[df$x == 2, ]
,
df[c("x", "z")]
,
df[, c("x", "z")]
,
str(df["x"])
, andstr(df[, "x"])
?
By default, subsetting a matrix or data frame with a single number, a single name, or a logical vector containing a single
TRUE
, will simplify the returned output, i.e. it will return an object with lower dimensionality.To preserve the original dimensionality, you must use
drop = FALSE
a <- matrix(1:4, nrow = 2)str(a[1, ])
## int [1:2] 1 3
str(a[1, , drop = FALSE])
## int [1, 1:2] 1 3
df <- data.frame(a = 1:2, b = 1:2)str(df[, "a"])
## int [1:2] 1 2
str(df[, "a", drop = FALSE])
## 'data.frame': 2 obs. of 1 variable:## $ a: int 1 2
There are two other subsetting operators:
[[
and$
.
[[
is used for extracting single items, whilex$y
is a useful shorthand forx[["y"]]
[[
Primary use case for [[
is when working with lists, as you get a list back.
If list x is a train carrying objects, then
x[[5]]
is the object in car5
;x[4:6]
is a train of cars4-6
.— @RLangTip
x <- list(1:3, "a", 4:6)
[[
[[
[[
, it will subset recursively, i.e. x[[c(1, 2)]]
is equivalent to x[[1]][[2]]
.$
$
is a shorthand operator: x$y
is roughly equivalent to x[["y"]]
. $
is to use it when you have the name of a column stored in a variable:If
var <- "cyl"
,mtcars$var
doesn't work because it is translated tomtcars[["var"]]
. Instead usemtcars[[var]]
$
$
and [[
is that $
does (left-to-right) partial matching.x <- list(abc = 1)x$a
## [1] 1
x[["a"]]
## NULL
options(warnPartialMatchDollar = TRUE)x$a
## [1] 1
Remember: For data frames, you can also avoid this problem by using tibbles, which never do partial matching.
@
(equivalent to $
)slot()
(equivalent to [[
).
@
is more restrictive than$
in that it will return an error if the slot does not exist.
Subassignment: Combining subsetting operators with assignments to modify selected values in an input vector.
The basic form is x[i] <- value
Ensure that:
length(value)
== length(x[i])
# wafanyikazi$new_var <- 1:10000# Error in `$<-.data.frame`(`*tmp*`, new_var, value = 1:10000) : #replacement has 10000 rows, data has 500
i
is uniqueTo remove a component, use x[[i]] <- NULL
departments <- list("data", "operations", "finance")departments
## [[1]]## [1] "data"## ## [[2]]## [1] "operations"## ## [[3]]## [1] "finance"
departments[[3]] <- NULLdepartments
## [[1]]## [1] "data"## ## [[2]]## [1] "operations"
To add a literal NULL, use x[i] <- list(NULL)
departments[3] <- list(NULL)departments
## [[1]]## [1] "data"## ## [[2]]## [1] "operations"## ## [[3]]## NULL
Lookup tables (character subsetting)
Matching and merging by hand (integer subsetting)
Lookup tables (character subsetting)
Matching and merging by hand (integer subsetting)
Random samples and bootstraps (integer subsetting)
Lookup tables (character subsetting)
Matching and merging by hand (integer subsetting)
Random samples and bootstraps (integer subsetting)
Ordering (integer subsetting)
Lookup tables (character subsetting)
Matching and merging by hand (integer subsetting)
Random samples and bootstraps (integer subsetting)
Ordering (integer subsetting)
Expanding aggregated counts (integer subsetting)
Lookup tables (character subsetting)
Matching and merging by hand (integer subsetting)
Random samples and bootstraps (integer subsetting)
Ordering (integer subsetting)
Expanding aggregated counts (integer subsetting)
Removing columns from data frames (character )
Lookup tables (character subsetting)
Matching and merging by hand (integer subsetting)
Random samples and bootstraps (integer subsetting)
Ordering (integer subsetting)
Expanding aggregated counts (integer subsetting)
Removing columns from data frames (character )
Selecting rows based on a condition (logical subsetting)
Lookup tables (character subsetting)
Matching and merging by hand (integer subsetting)
Random samples and bootstraps (integer subsetting)
Ordering (integer subsetting)
Expanding aggregated counts (integer subsetting)
Removing columns from data frames (character )
Selecting rows based on a condition (logical subsetting)
Boolean algebra versus sets (logical and integer )
Character matching
x <- c("m", "f", "u", "f", "f", "m", "m")lookup <- c(m = "Male", f = "Female", u = NA)lookup[x] ## Is this the same as saying look for x in the vector lookup? Is it also the same as using an ifelse function?
## m f u f f m m ## "Male" "Female" NA "Female" "Female" "Male" "Male"
We can exclude names in the results using:
unname(lookup[x])
## [1] "Male" "Female" NA "Female" "Female" "Male" "Male"
grades <- c(1, 2, 2, 3, 1)info <- data.frame( grade = 3:1, desc = c("Excellent", "Good", "Poor"), fail = c(F, F, T))head(info)
## grade desc fail## 1 3 Excellent FALSE## 2 2 Good FALSE## 3 1 Poor TRUE
Assuming we want to duplicate the info table so that we have a row for each value in grades.
match(needles, haystack)
// look for (needles, haystack)
What is the position of the needles [grades elements : (1,2,2,3,1)] in the haystack [info$grade: (3,2,1)]
id <- match(grades, info$grade)id
## [1] 3 2 2 1 3
info[id, ]
## grade desc fail## 3 1 Poor TRUE## 2 2 Good FALSE## 2.1 2 Good FALSE## 1 3 Excellent FALSE## 3.1 1 Poor TRUE
When matching on multiple columns, you will need to first collapse them into a single column (with e.g interaction()
).
## insert intersection code here
But dplyr{} *_join()
functions would be your best friends at this point
Using integer indices to randomly sample or bootstrap a vector or data frame.
Use sample(n)
to generate a random permutation of 1:n
, and then use the results to subset the values
Simulate a dataframe
df = data.frame(names = c("John", "Teresa", "Shel", "Christine", "Brenda"), gender = c("M", "F", "F", "F", "F"), rshp = c("Father", "Mother", "Self", "Sister", "Sister"))df
## names gender rshp## 1 John M Father## 2 Teresa F Mother## 3 Shel F Self## 4 Christine F Sister## 5 Brenda F Sister
Reorder the dataframe randomly
df[sample(nrow(df)), ]
## names gender rshp## 1 John M Father## 5 Brenda F Sister## 4 Christine F Sister## 3 Shel F Self## 2 Teresa F Mother
Select two random rows
df[sample(nrow(df), 2), ]
## names gender rshp## 5 Brenda F Sister## 3 Shel F Self
Select 7 bootstrap replicates
df[sample(nrow(df), 7, replace = T), ]
## names gender rshp## 5 Brenda F Sister## 3 Shel F Self## 5.1 Brenda F Sister## 3.1 Shel F Self## 3.2 Shel F Self## 1 John M Father## 1.1 John M Father
order()
takes a vector as its input and returns an integer vector describing how to order the subsetted vector
fam <- c("John", "Teresa", "Shel", "Christine", "Brenda")order(fam) ## orders alphabetically (in ascending order by default)
## [1] 5 4 1 3 2
fam[order(fam)]
## [1] "Brenda" "Christine" "John" "Shel" "Teresa"
## We can also order the vector in ascending orderfam[order(fam, decreasing = T)]
## [1] "Teresa" "Shel" "John" "Christine" "Brenda"
NB: By default, any missing values will be put at the end of the vector; however, you can remove them with na.last = NA
or put them at the front with na.last = FALSE
.
# us <- c("Me", "You", NA)# us[order(us)]# us[order(us, na.last = FALSE)]
Using order()
to order values in a variable, or variables themselves, in a dataframe
# Randomly reorder dfdf2 <- df[sample(nrow(df)), 3:1]df2
## rshp gender names## 4 Sister F Christine## 2 Mother F Teresa## 3 Self F Shel## 1 Father M John## 5 Sister F Brenda
# Order by one variabledf[order(df$gender), ]
## names gender rshp## 2 Teresa F Mother## 3 Shel F Self## 4 Christine F Sister## 5 Brenda F Sister## 1 John M Father
# Order the variables themselvesdf[, order(names(df))]
## gender names rshp## 1 M John Father## 2 F Teresa Mother## 3 F Shel Self## 4 F Christine Sister## 5 F Brenda Sister
You can sort vectors directly with sort()
, or similarly dplyr::arrange()
, to sort a data frame.
df <- data.frame(x = c(2, 4, 1), y = c(9, 11, 6), n = c(3, 5, 1))df
## x y n## 1 2 9 3## 2 4 11 5## 3 1 6 1
rep(1:nrow(df), df$n)
## [1] 1 1 1 2 2 2 2 2 3
df[rep(1:nrow(df), df$n), ]
## x y n## 1 2 9 3## 1.1 2 9 3## 1.2 2 9 3## 2 4 11 5## 2.1 4 11 5## 2.2 4 11 5## 2.3 4 11 5## 2.4 4 11 5## 3 1 6 1
Method 1: Set individual columns to NULL
df = data.frame(names = c("John", "Teresa", "Shel", "Christine", "Brenda"), gender = c("M", "F", "F", "F", "F"), rshp = c("Father", "Mother", "Self", "Sister", "Sister"))df
## names gender rshp## 1 John M Father## 2 Teresa F Mother## 3 Shel F Self## 4 Christine F Sister## 5 Brenda F Sister
## create a copy of the dataframedf2 <- df## drop a columndf2$gender <- NULLdf2
## names rshp## 1 John Father## 2 Teresa Mother## 3 Shel Self## 4 Christine Sister## 5 Brenda Sister
Method 2: Subset to return only the columns you want
df[c("names", "rshp")]
## names rshp## 1 John Father## 2 Teresa Mother## 3 Shel Self## 4 Christine Sister## 5 Brenda Sister
Method 3: Use set operations to work out which columns to keep. This is useful when you only know the columns that you don't want.
to_keep <- setdiff(names(df), "gender")to_keep
## [1] "names" "rshp"
df[to_keep]
## names rshp## 1 John Father## 2 Teresa Mother## 3 Shel Self## 4 Christine Sister## 5 Brenda Sister
library(rChambua)head(wafanyikazi, n=3)
## Sid Gender Age Department Role Income Marital_Status County## 1 10715 Male 31 Finance Mid 5991 Single Kisumu## 2 17041 Male 48 Research Analyst Junior 3387 Divorced Wajir## 3 16232 Male 35 Operations Junior 3170 Married Mombasa## Leave_Days Promotion## 1 11 No## 2 8 Yes## 3 0 No
Select all juniors
df1 <- wafanyikazi[wafanyikazi$Role == "Junior",]head(df1, 3)
## Sid Gender Age Department Role Income Marital_Status County## 2 17041 Male 48 Research Analyst Junior 3387 Divorced Wajir## 3 16232 Male 35 Operations Junior 3170 Married Mombasa## 5 13463 Female 43 Associate Junior 1651 Married Nairobi## Leave_Days Promotion## 2 8 Yes## 3 0 No## 5 2 Yes
Select females who come from Nyeri county
df2 <- wafanyikazi[wafanyikazi$Gender == "Female" & wafanyikazi$County == "Nyeri",]head(df2, 3)
## Sid Gender Age Department Role Income Marital_Status County Leave_Days## 4 19576 Female 41 Finance Senior 5557 Married Nyeri 8## 34 18997 Female 32 Associate Senior 5340 Single Nyeri 14## 67 19891 Female 48 Operations Senior 9029 Married Nyeri 14## Promotion## 4 No## 34 No## 67 No
De Morgan’s laws:
!(X & Y)
is the same as !X | !Y
!(X | Y)
is the same as !X & !Y
Two types of subsetting.
integer subsetting: (set operations)
Effective when you want to find the first (or last) TRUE and
You have very few TRUEs and very many FALSEs; a set representation may be faster and require less storage.
logical subsetting: (Boolean algebra)
which()
allows you to convert a Boolean representation to an integer representation.
which(df$names %in% "John")
## [1] 1
You can create a function that does the reverse i.e. converts an integer representation to a Boolean representation.
Do we really need to do this?
unwhich <- function(x, n) { out <- rep_len(FALSE, n) out[x] <- TRUE out}unwhich(which(df$names %in% "John"), length(df$names))
## [1] TRUE FALSE FALSE FALSE FALSE
When we can just do this?
df$names %in% "John"
## [1] TRUE FALSE FALSE FALSE FALSE
Create two logical vectors (x1 , y1) and their integer equivalents (x2, y2)
(x1 <- 1:10 %% 2 == 0)
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
(x2 <- which(x1))
## [1] 2 4 6 8 10
(y1 <- 1:10 %% 5 == 0)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
(y2 <- which(y1))
## [1] 5 10
x1 & y1
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
intersect(x2, y2)
## [1] 10
x1 & y1
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
intersect(x2, y2)
## [1] 10
x1 | y1
## [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
union(x2, y2)
## [1] 2 4 6 8 10 5
x1 & !y1
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
setdiff(x2, y2)
## [1] 2 4 6 8
x1 & !y1
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
setdiff(x2, y2)
## [1] 2 4 6 8
xor(x1, y1)
## [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
setdiff(union(x2, y2), intersect(x2, y2))
## [1] 2 4 6 8 5
# Read in the datadf <- rChambua::wafanyikazihead(df1, n=3)
## Sid Gender Age Department Role Income Marital_Status County## 2 17041 Male 48 Research Analyst Junior 3387 Divorced Wajir## 3 16232 Male 35 Operations Junior 3170 Married Mombasa## 5 13463 Female 43 Associate Junior 1651 Married Nairobi## Leave_Days Promotion## 2 8 Yes## 3 0 No## 5 2 Yes
# Permutate columnsdf1 <- df[,sample(names(df))]head(df1, n=3)
## Role Marital_Status Leave_Days Gender Sid Promotion Department## 1 Mid Single 11 Male 10715 No Finance## 2 Junior Divorced 8 Male 17041 Yes Research Analyst## 3 Junior Married 0 Male 16232 No Operations## County Age Income## 1 Kisumu 31 5991## 2 Wajir 48 3387## 3 Mombasa 35 3170
# Permutate columnsdf1 <- df[,sample(names(df))]head(df1, n=3)
## Role Marital_Status Leave_Days Gender Sid Promotion Department## 1 Mid Single 11 Male 10715 No Finance## 2 Junior Divorced 8 Male 17041 Yes Research Analyst## 3 Junior Married 0 Male 16232 No Operations## County Age Income## 1 Kisumu 31 5991## 2 Wajir 48 3387## 3 Mombasa 35 3170
# Permutate rows and columnsdf2 <- df[sample(nrow(df)),sample(names(df))]head(df2, n=3)
## Leave_Days Role Marital_Status Age Income Sid Promotion Gender County## 224 20 Junior Married 33 7727 17704 Yes Male Nairobi## 405 18 Senior Divorced 30 6425 17770 No Male Lamu## 463 11 Junior Divorced 29 3604 15549 Yes Female Taita## Department## 224 Finance## 405 Data## 463 Associate
# Generate a vector of the first and last row idsfirst_last_ids <- c(1,nrow(df))first_last_ids
## [1] 1 500
# Generate a vector of the first and last row idsfirst_last_ids <- c(1,nrow(df))first_last_ids
## [1] 1 500
# Sample m (2) rows from the dataframe, excluding the first and last rowsoriginal_ids <- 1:nrow(df)other_ids <- sample(original_ids[!original_ids %in% first_last_ids] , 2)other_ids
## [1] 289 488
# Generate a vector of the first and last row idsfirst_last_ids <- c(1,nrow(df))first_last_ids
## [1] 1 500
# Sample m (2) rows from the dataframe, excluding the first and last rowsoriginal_ids <- 1:nrow(df)other_ids <- sample(original_ids[!original_ids %in% first_last_ids] , 2)other_ids
## [1] 289 488
# Combine the first, last and the rows in betweenfinal_ids <- c(first_last_ids[1], other_ids, first_last_ids[2])final_ids
## [1] 1 289 488 500
# Call the data, with only these specific rowsdf3 <- df[final_ids,]df3
## Sid Gender Age Department Role Income Marital_Status County## 1 10715 Male 31 Finance Mid 5991 Single Kisumu## 289 14070 Male 24 Finance Junior 9680 Single Kirinyaga## 488 19363 Female 38 Associate Junior 8378 Divorced Nyeri## 500 16114 Female 22 Finance Junior 2736 Divorced Mombasa## Leave_Days Promotion## 1 11 No## 289 2 Yes## 488 0 No## 500 24 Yes
df4 <- df[,order(names(df))]head(df4)
## Age County Department Gender Income Leave_Days Marital_Status## 1 31 Kisumu Finance Male 5991 11 Single## 2 48 Wajir Research Analyst Male 3387 8 Divorced## 3 35 Mombasa Operations Male 3170 0 Married## 4 41 Nyeri Finance Female 5557 8 Married## 5 43 Nairobi Associate Female 1651 2 Married## 6 30 Taita Finance Female 6859 9 Single## Promotion Role Sid## 1 No Mid 10715## 2 Yes Junior 17041## 3 No Junior 16232## 4 No Senior 19576## 5 Yes Junior 13463## 6 Yes Junior 19788
...
Section 4.1: Introduction
Section 4.2: Selecting multiple elements
Section 4.3: Selecting a single element
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |