3.7 Data frames and tibbles

Credit: Advanced R by Hadley Wickham

3.7.1 Data frame

A data frame is a:

  • Named list of vectors (i.e., column names)
  • Attributes:
    • (column) names
    • row.names
    • Class: “data frame”
# Construct
df <- data.frame(
  col1 = c(1, 2, 3),              # named atomic vector
  col2 = c("un", "deux", "trois") # another named atomic vector
  # ,stringsAsFactors = FALSE # default for versions after R 4.1
)

# Inspect
df
#>   col1  col2
#> 1    1    un
#> 2    2  deux
#> 3    3 trois

# Deconstruct
# - type
typeof(df)
#> [1] "list"
# - attributes
attributes(df)
#> $names
#> [1] "col1" "col2"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1 2 3
rownames(df)
#> [1] "1" "2" "3"
colnames(df)
#> [1] "col1" "col2"
names(df) # Same as colnames(df)
#> [1] "col1" "col2"

nrow(df) 
#> [1] 3
ncol(df)
#> [1] 2
length(df) # Same as ncol(df)
#> [1] 2

Unlike other lists, the length of each vector must be the same (i.e. as many vector elements as rows in the data frame).

3.7.2 Tibble

Created to relieve some of the frustrations and pain points created by data frames, tibbles are data frames that are:

  • Lazy (do less)
  • Surly (complain more)

3.7.2.1 Lazy

Tibbles do not:

  • Coerce strings
  • Transform non-syntactic names
  • Recycle vectors of length greater than 1

Coerce strings

chr_col <- c("don't", "factor", "me", "bro")

# data frame
df <- data.frame(
  a = chr_col,
  # in R 4.1 and earlier, this was the default
  stringsAsFactors = TRUE
)

# tibble
tbl <- tibble::tibble(
  a = chr_col
)

# contrast the structure
str(df$a)
#>  Factor w/ 4 levels "bro","don't",..: 2 3 4 1
str(tbl$a)
#>  chr [1:4] "don't" "factor" "me" "bro"

Transform non-syntactic names

# data frame
df <- data.frame(
  `1` = c(1, 2, 3)
)

# tibble
tbl <- tibble::tibble(
  `1` = c(1, 2, 3)
)

# contrast the names
names(df)
#> [1] "X1"
names(tbl)
#> [1] "1"

Recycle vectors of length greater than 1

# data frame
df <- data.frame(
  col1 = c(1, 2, 3, 4),
  col2 = c(1, 2)
)

# tibble
tbl <- tibble::tibble(
  col1 = c(1, 2, 3, 4),
  col2 = c(1, 2)
)
#> Error in `tibble::tibble()`:
#> ! Tibble columns must have compatible sizes.
#> • Size 4: Existing data.
#> • Size 2: Column `col2`.
#> ℹ Only values of size one are recycled.

3.7.2.2 Surly

Tibbles do only what they’re asked and complain if what they’re asked doesn’t make sense:

  • Subsetting always yields a tibble
  • Complains if cannot find column

Subsetting always yields a tibble

# data frame
df <- data.frame(
  col1 = c(1, 2, 3, 4)
)

# tibble
tbl <- tibble::tibble(
  col1 = c(1, 2, 3, 4)
)

# contrast
df_col <- df[, "col1"]
str(df_col)
#>  num [1:4] 1 2 3 4
tbl_col <- tbl[, "col1"]
str(tbl_col)
#> tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ col1: num [1:4] 1 2 3 4

# to select a vector, do one of these instead
tbl_col_1 <- tbl[["col1"]]
str(tbl_col_1)
#>  num [1:4] 1 2 3 4
tbl_col_2 <- dplyr::pull(tbl, col1)
str(tbl_col_2)
#>  num [1:4] 1 2 3 4

Complains if cannot find column

names(df)
#> [1] "col1"
df$col
#> [1] 1 2 3 4

names(tbl)
#> [1] "col1"
tbl$col
#> Warning: Unknown or uninitialised column: `col`.
#> NULL

3.7.2.3 One more difference

tibble() allows you to refer to variables created during construction

tibble::tibble(
  x = 1:3,
  y = x * 2 # x refers to the line above
)
#> # A tibble: 3 × 2
#>       x     y
#>   <int> <dbl>
#> 1     1     2
#> 2     2     4
#> 3     3     6
Side Quest: Row Names
  • character vector containing only unique values
  • get and set with rownames()
  • can use them to subset rows
df3 <- data.frame(
  age = c(35, 27, 18),
  hair = c("blond", "brown", "black"),
  row.names = c("Bob", "Susan", "Sam")
)
df3
#>       age  hair
#> Bob    35 blond
#> Susan  27 brown
#> Sam    18 black

rownames(df3)
#> [1] "Bob"   "Susan" "Sam"
df3["Bob", ]
#>     age  hair
#> Bob  35 blond

rownames(df3) <- c("Susan", "Bob", "Sam")
rownames(df3)
#> [1] "Susan" "Bob"   "Sam"
df3["Bob", ]
#>     age  hair
#> Bob  27 brown

There are three reasons why row names are undesirable:

  1. Metadata is data, so storing it in a different way to the rest of the data is fundamentally a bad idea.
  2. Row names are a poor abstraction for labelling rows because they only work when a row can be identified by a single string. This fails in many cases.
  3. Row names must be unique, so any duplication of rows (e.g. from bootstrapping) will create new row names.

3.7.3 Printing

Data frames and tibbles print differently

df3
#>       age  hair
#> Susan  35 blond
#> Bob    27 brown
#> Sam    18 black
tibble::as_tibble(df3)
#> # A tibble: 3 × 2
#>     age hair 
#>   <dbl> <chr>
#> 1    35 blond
#> 2    27 brown
#> 3    18 black

3.7.4 Subsetting

Two undesirable subsetting behaviours:

  1. When you subset columns with df[, vars], you will get a vector if vars selects one variable, otherwise you’ll get a data frame, unless you always remember to use df[, vars, drop = FALSE].
  2. When you attempt to extract a single column with df$x and there is no column x, a data frame will instead select any variable that starts with x. If no variable starts with x, df$x will return NULL.

Tibbles tweak these behaviours so that a [ always returns a tibble, and a $ doesn’t do partial matching and warns if it can’t find a variable (this is what makes tibbles surly).

3.7.5 Testing

Whether data frame: is.data.frame(). Note: both data frame and tibble are data frames.

Whether tibble: tibble::is_tibble. Note: only tibbles are tibbles. Vanilla data frames are not.

3.7.6 Coercion

  • To data frame: as.data.frame()
  • To tibble: tibble::as_tibble()

3.7.7 List Columns

List-columns are allowed in data frames but you have to do a little extra work by either adding the list-column after creation or wrapping the list in I()

df4 <- data.frame(x = 1:3)
df4$y <- list(1:2, 1:3, 1:4)
df4
#>   x          y
#> 1 1       1, 2
#> 2 2    1, 2, 3
#> 3 3 1, 2, 3, 4

df5 <- data.frame(
  x = 1:3, 
  y = I(list(1:2, 1:3, 1:4))
)
df5
#>   x          y
#> 1 1       1, 2
#> 2 2    1, 2, 3
#> 3 3 1, 2, 3, 4

3.7.8 Matrix and data frame columns

  • As long as the number of rows matches the data frame, it’s also possible to have a matrix or data frame as a column of a data frame.
  • same as list-columns, must either addi the list-column after creation or wrapping the list in I()
dfm <- data.frame(
  x = 1:3 * 10,
  y = I(matrix(1:9, nrow = 3))
)

dfm$z <- data.frame(a = 3:1, b = letters[1:3], stringsAsFactors = FALSE)

str(dfm)
#> 'data.frame':    3 obs. of  3 variables:
#>  $ x: num  10 20 30
#>  $ y: 'AsIs' int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
#>  $ z:'data.frame':   3 obs. of  2 variables:
#>   ..$ a: int  3 2 1
#>   ..$ b: chr  "a" "b" "c"
dfm$y
#>      [,1] [,2] [,3]
#> [1,]    1    4    7
#> [2,]    2    5    8
#> [3,]    3    6    9
dfm$z
#>   a b
#> 1 3 a
#> 2 2 b
#> 3 1 c

3.7.9 Exercises

  1. Can you have a data frame with zero rows? What about zero columns?
Answer(s)

Yes, you can create these data frames easily; either during creation or via subsetting. Even both dimensions can be zero. Create a 0-row, 0-column, or an empty data frame directly:

data.frame(a = integer(), b = logical())
#> [1] a b
#> <0 rows> (or 0-length row.names)

data.frame(row.names = 1:3)  # or data.frame()[1:3, ]
#> data frame with 0 columns and 3 rows

data.frame()
#> data frame with 0 columns and 0 rows

Create similar data frames via subsetting the respective dimension with either 0, NULL, FALSE or a valid 0-length atomic (logical(0), character(0), integer(0), double(0)). Negative integer sequences would also work. The following example uses a zero:

mtcars[0, ]
#>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
#> <0 rows> (or 0-length row.names)

mtcars[ , 0]  # or mtcars[0]
#> data frame with 0 columns and 32 rows

mtcars[0, 0]
#> data frame with 0 columns and 0 rows
  1. What happens if you attempt to set rownames that are not unique?
Answer(s)

Matrices can have duplicated row names, so this does not cause problems.

Data frames, however, require unique rownames and you get different results depending on how you attempt to set them. If you set them directly or via row.names(), you get an error:

data.frame(row.names = c("x", "y", "y"))
#> Error in data.frame(row.names = c("x", "y", "y")): duplicate row.names: y

df <- data.frame(x = 1:3)
row.names(df) <- c("x", "y", "y")
#> Warning: non-unique value when setting 'row.names': 'y'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed

If you use subsetting, [ automatically deduplicates:

row.names(df) <- c("x", "y", "z")
df[c(1, 1, 1), , drop = FALSE]
#>     x
#> x   1
#> x.1 1
#> x.2 1
  1. If df is a data frame, what can you say about t(df), and t(t(df))? Perform some experiments, making sure to try different column types.
Answer(s)

Both of t(df) and t(t(df)) will return matrices:

df <- data.frame(x = 1:3, y = letters[1:3])
is.matrix(df)
#> [1] FALSE
is.matrix(t(df))
#> [1] TRUE
is.matrix(t(t(df)))
#> [1] TRUE

The dimensions will respect the typical transposition rules:

dim(df)
#> [1] 3 2
dim(t(df))
#> [1] 2 3
dim(t(t(df)))
#> [1] 3 2

Because the output is a matrix, every column is coerced to the same type. (It is implemented within t.data.frame() via as.matrix() which is described below).

df
#>   x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
t(df)
#>   [,1] [,2] [,3]
#> x "1"  "2"  "3" 
#> y "a"  "b"  "c"
  1. What does as.matrix() do when applied to a data frame with columns of different types? How does it differ from data.matrix()?
Answer(s)

The type of the result of as.matrix depends on the types of the input columns (see ?as.matrix):

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g. all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give an integer matrix, etc.

On the other hand, data.matrix will always return a numeric matrix (see ?data.matrix()).

Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes. […] Character columns are first converted to factors and then to integers.

We can illustrate and compare the mechanics of these functions using a concrete example. as.matrix() makes it possible to retrieve most of the original information from the data frame but leaves us with characters. To retrieve all information from data.matrix()’s output, we would need a lookup table for each column.

df_coltypes <- data.frame(
  a = c("a", "b"),
  b = c(TRUE, FALSE),
  c = c(1L, 0L),
  d = c(1.5, 2),
  e = factor(c("f1", "f2"))
)

as.matrix(df_coltypes)
#>      a   b       c   d     e   
#> [1,] "a" "TRUE"  "1" "1.5" "f1"
#> [2,] "b" "FALSE" "0" "2.0" "f2"
data.matrix(df_coltypes)
#>      a b c   d e
#> [1,] 1 1 1 1.5 1
#> [2,] 2 0 0 2.0 2