3.7 Data frames and tibbles
Credit: Advanced R by Hadley Wickham
3.7.1 Data frame
A data frame is a:
- Named list of vectors (i.e., column names)
- Attributes:
- (column)
names
row.names
- Class: “data frame”
- (column)
# Construct
df <- data.frame(
col1 = c(1, 2, 3), # named atomic vector
col2 = c("un", "deux", "trois") # another named atomic vector
# ,stringsAsFactors = FALSE # default for versions after R 4.1
)
# Inspect
df
#> col1 col2
#> 1 1 un
#> 2 2 deux
#> 3 3 trois
# Deconstruct
# - type
typeof(df)
#> [1] "list"
# - attributes
attributes(df)
#> $names
#> [1] "col1" "col2"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1 2 3
rownames(df)
#> [1] "1" "2" "3"
colnames(df)
#> [1] "col1" "col2"
names(df) # Same as colnames(df)
#> [1] "col1" "col2"
nrow(df)
#> [1] 3
ncol(df)
#> [1] 2
length(df) # Same as ncol(df)
#> [1] 2
Unlike other lists, the length of each vector must be the same (i.e. as many vector elements as rows in the data frame).
3.7.2 Tibble
Created to relieve some of the frustrations and pain points created by data frames, tibbles are data frames that are:
- Lazy (do less)
- Surly (complain more)
3.7.2.1 Lazy
Tibbles do not:
- Coerce strings
- Transform non-syntactic names
- Recycle vectors of length greater than 1
Coerce strings
chr_col <- c("don't", "factor", "me", "bro")
# data frame
df <- data.frame(
a = chr_col,
# in R 4.1 and earlier, this was the default
stringsAsFactors = TRUE
)
# tibble
tbl <- tibble::tibble(
a = chr_col
)
# contrast the structure
str(df$a)
#> Factor w/ 4 levels "bro","don't",..: 2 3 4 1
str(tbl$a)
#> chr [1:4] "don't" "factor" "me" "bro"
Transform non-syntactic names
# data frame
df <- data.frame(
`1` = c(1, 2, 3)
)
# tibble
tbl <- tibble::tibble(
`1` = c(1, 2, 3)
)
# contrast the names
names(df)
#> [1] "X1"
names(tbl)
#> [1] "1"
Recycle vectors of length greater than 1
# data frame
df <- data.frame(
col1 = c(1, 2, 3, 4),
col2 = c(1, 2)
)
# tibble
tbl <- tibble::tibble(
col1 = c(1, 2, 3, 4),
col2 = c(1, 2)
)
#> Error in `tibble::tibble()`:
#> ! Tibble columns must have compatible sizes.
#> • Size 4: Existing data.
#> • Size 2: Column `col2`.
#> ℹ Only values of size one are recycled.
3.7.2.2 Surly
Tibbles do only what they’re asked and complain if what they’re asked doesn’t make sense:
- Subsetting always yields a tibble
- Complains if cannot find column
Subsetting always yields a tibble
# data frame
df <- data.frame(
col1 = c(1, 2, 3, 4)
)
# tibble
tbl <- tibble::tibble(
col1 = c(1, 2, 3, 4)
)
# contrast
df_col <- df[, "col1"]
str(df_col)
#> num [1:4] 1 2 3 4
tbl_col <- tbl[, "col1"]
str(tbl_col)
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#> $ col1: num [1:4] 1 2 3 4
# to select a vector, do one of these instead
tbl_col_1 <- tbl[["col1"]]
str(tbl_col_1)
#> num [1:4] 1 2 3 4
tbl_col_2 <- dplyr::pull(tbl, col1)
str(tbl_col_2)
#> num [1:4] 1 2 3 4
Complains if cannot find column
3.7.2.3 One more difference
tibble()
allows you to refer to variables created during construction
tibble::tibble(
x = 1:3,
y = x * 2 # x refers to the line above
)
#> # A tibble: 3 × 2
#> x y
#> <int> <dbl>
#> 1 1 2
#> 2 2 4
#> 3 3 6
Side Quest: Row Names
- character vector containing only unique values
- get and set with
rownames()
- can use them to subset rows
df3 <- data.frame(
age = c(35, 27, 18),
hair = c("blond", "brown", "black"),
row.names = c("Bob", "Susan", "Sam")
)
df3
#> age hair
#> Bob 35 blond
#> Susan 27 brown
#> Sam 18 black
rownames(df3)
#> [1] "Bob" "Susan" "Sam"
df3["Bob", ]
#> age hair
#> Bob 35 blond
rownames(df3) <- c("Susan", "Bob", "Sam")
rownames(df3)
#> [1] "Susan" "Bob" "Sam"
df3["Bob", ]
#> age hair
#> Bob 27 brown
There are three reasons why row names are undesirable:
- Metadata is data, so storing it in a different way to the rest of the data is fundamentally a bad idea.
- Row names are a poor abstraction for labelling rows because they only work when a row can be identified by a single string. This fails in many cases.
- Row names must be unique, so any duplication of rows (e.g. from bootstrapping) will create new row names.
3.7.4 Subsetting
Two undesirable subsetting behaviours:
- When you subset columns with
df[, vars]
, you will get a vector if vars selects one variable, otherwise you’ll get a data frame, unless you always remember to usedf[, vars, drop = FALSE]
. - When you attempt to extract a single column with
df$x
and there is no columnx
, a data frame will instead select any variable that starts withx
. If no variable starts withx
,df$x
will return NULL.
Tibbles tweak these behaviours so that a [ always returns a tibble, and a $ doesn’t do partial matching and warns if it can’t find a variable (this is what makes tibbles surly).
3.7.5 Testing
Whether data frame: is.data.frame()
. Note: both data frame and tibble are data frames.
Whether tibble: tibble::is_tibble
. Note: only tibbles are tibbles. Vanilla data frames are not.
3.7.7 List Columns
List-columns are allowed in data frames but you have to do a little extra work by either adding the list-column after creation or wrapping the list in I()
3.7.8 Matrix and data frame columns
- As long as the number of rows matches the data frame, it’s also possible to have a matrix or data frame as a column of a data frame.
- same as list-columns, must either addi the list-column after creation or wrapping the list in
I()
dfm <- data.frame(
x = 1:3 * 10,
y = I(matrix(1:9, nrow = 3))
)
dfm$z <- data.frame(a = 3:1, b = letters[1:3], stringsAsFactors = FALSE)
str(dfm)
#> 'data.frame': 3 obs. of 3 variables:
#> $ x: num 10 20 30
#> $ y: 'AsIs' int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
#> $ z:'data.frame': 3 obs. of 2 variables:
#> ..$ a: int 3 2 1
#> ..$ b: chr "a" "b" "c"
dfm$y
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
dfm$z
#> a b
#> 1 3 a
#> 2 2 b
#> 3 1 c
3.7.9 Exercises
- Can you have a data frame with zero rows? What about zero columns?
Answer(s)
Yes, you can create these data frames easily; either during creation or via subsetting. Even both dimensions can be zero. Create a 0-row, 0-column, or an empty data frame directly:
data.frame(a = integer(), b = logical())
#> [1] a b
#> <0 rows> (or 0-length row.names)
data.frame(row.names = 1:3) # or data.frame()[1:3, ]
#> data frame with 0 columns and 3 rows
data.frame()
#> data frame with 0 columns and 0 rows
Create similar data frames via subsetting the respective dimension with either 0, NULL
, FALSE
or a valid 0-length atomic (logical(0)
, character(0)
, integer(0)
, double(0)
). Negative integer sequences would also work. The following example uses a zero:
- What happens if you attempt to set rownames that are not unique?
Answer(s)
Matrices can have duplicated row names, so this does not cause problems.
Data frames, however, require unique rownames and you get different results depending on how you attempt to set them. If you set them directly or via row.names()
, you get an error:
data.frame(row.names = c("x", "y", "y"))
#> Error in data.frame(row.names = c("x", "y", "y")): duplicate row.names: y
df <- data.frame(x = 1:3)
row.names(df) <- c("x", "y", "y")
#> Warning: non-unique value when setting 'row.names': 'y'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
If you use subsetting, [
automatically deduplicates:
- If
df
is a data frame, what can you say aboutt(df)
, andt(t(df))
? Perform some experiments, making sure to try different column types.
Answer(s)
Both of t(df)
and t(t(df))
will return matrices:
df <- data.frame(x = 1:3, y = letters[1:3])
is.matrix(df)
#> [1] FALSE
is.matrix(t(df))
#> [1] TRUE
is.matrix(t(t(df)))
#> [1] TRUE
The dimensions will respect the typical transposition rules:
Because the output is a matrix, every column is coerced to the same type. (It is implemented within t.data.frame()
via as.matrix()
which is described below).
- What does
as.matrix()
do when applied to a data frame with columns of different types? How does it differ fromdata.matrix()
?
Answer(s)
The type of the result of as.matrix depends on the types of the input columns (see ?as.matrix
):
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g. all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give an integer matrix, etc.
On the other hand, data.matrix
will always return a numeric matrix (see ?data.matrix()
).
Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes. […] Character columns are first converted to factors and then to integers.
We can illustrate and compare the mechanics of these functions using a concrete example. as.matrix()
makes it possible to retrieve most of the original information from the data frame but leaves us with characters. To retrieve all information from data.matrix()
’s output, we would need a lookup table for each column.
df_coltypes <- data.frame(
a = c("a", "b"),
b = c(TRUE, FALSE),
c = c(1L, 0L),
d = c(1.5, 2),
e = factor(c("f1", "f2"))
)
as.matrix(df_coltypes)
#> a b c d e
#> [1,] "a" "TRUE" "1" "1.5" "f1"
#> [2,] "b" "FALSE" "0" "2.0" "f2"
data.matrix(df_coltypes)
#> a b c d e
#> [1,] 1 1 1 1.5 1
#> [2,] 2 0 0 2.0 2