Evaluation

Learning objectives:

  • Learn evaluation basics
  • Learn about quosures and data mask
  • Understand tidy evaluation
library(rlang)
library(purrr)

A bit of a recap

  • Metaprogramming: To separate our description of the action from the action itself
    • Separate the code from its evaluation.
  • Quasiquotation: combine code written by the function’s author with code (expressions) written by the function’s user.
    • Unquotation: gives the user the ability to evaluate parts of a quoted argument.
    • Evaluation: gives the developer the ability to evaluate quoted expression in custom environments.

Non-standard evaluation (NSE): dealing with expressions as function inputs in base R.

Tidy evaluation: quasiquotation, quosures and data masks.

Evaluation basics

We use base::eval() to evaluate, run, or execute expressions. It requires two arguments:

  • expr: the object to evaluate, either an expression or a symbol.
  • envir: the environment in which to evaluate the expression or where to look for the values. Defaults to current environment (techncially parent.frame() which is the environment that eval() was called from).
sumexpr <- expr(x + y)
x <- 10
y <- 40
eval(sumexpr)
#> [1] 50
eval(sumexpr, envir = env(x = 1000, y = 10))
#> [1] 1010

Application: reimplementing source()

What do we need?

  • Read the file being sourced.
  • Parse its expressions (think: do we need to quote them?)
  • Evaluate each expression saving the results
  • Return the results (result of the last expression)
source2 <- function(path, env = caller_env()) {
  file <- paste(readLines(path, warn = FALSE), collapse = "\n")
  exprs <- parse_exprs(file)

  res <- NULL
  for (i in seq_along(exprs)) {
    res <- eval(exprs[[i]], env)
  }

  invisible(res)
}

The real source is much more complex.

Quosures (1)

eval() needs both the code and the environment to run the code in.

quosures are a data structure from rlang containing both the expression and the environment.

Quoting + closure because it quotes the expression and encloses the environment.

Three ways to create quosures, in the upcoming slides.

Quosures (2)

Used mostly for learning: new_quosure(), creates a quosure from its components.

q1 <- rlang::new_quosure(expr(x + y), 
                         env(x = 1, y = 10))

Quosures (3)

Used in the real world: enquo() or enquos(), to capture user supplied expressions. They take the environment from where they are created.

foo <- function(x) enquo(x)
quo_foo <- foo(a + b)
get_expr(quo_foo)
#> a + b
get_env(quo_foo)
#> <environment: R_GlobalEnv>

Quosures (4)

Almost never used: quo() and quos(), to match to expr() and exprs().

Evaluation: eval_tidy()

The purpose of the concept of quosure is to be able to pass a single object to rlang::eval_tidy(), as opposed to the expression-environment pair, as required for base::eval():

rlang::eval_tidy(q1)
#> [1] 11

Get the quosure components if you need them:

rlang::get_expr(q1)
#> x + y
rlang::get_env(q1)
#> <environment: 0x0000027140d5c6c8>

Or set them

q1 <- set_env(q1, env(x = 3, y = 4))
eval_tidy(q1)
#> [1] 7

Quosures and …

Quosures are just a convenience, but they are essential when it comes to working with ..., because you can have different arguments from ... associated with different environments.

g <- function(...) {
  ## Creating our quosures from ...
  enquos(...)
}

createQuos <- function(...) {
  ## symbol from the function environment
  x <- 1
  g(..., f = x)
}
## symbol from the global environment
x <- 0
qs <- createQuos(global = x)
qs
#> <list_of<quosure>>
#> 
#> $global
#> <quosure>
#> expr: ^x
#> env:  global
#> 
#> $f
#> <quosure>
#> expr: ^x
#> env:  0x00000271438a6388

Other facts about quosures

Formulas were the predecessor and inspiration for quosures because they also capture an expression and an environment.

f <- ~runif(3)
str(f)
#> Class 'formula'  language ~runif(3)
#>   ..- attr(*, ".Environment")=<environment: R_GlobalEnv>

(This creates weird problems when functions return formulas, which in turn may drag the entire GlobalEnv with them. Have 10Gb worth of objects in there? Let’s store them all!!!)

There was an early version of tidy evaluation with formulas, but there’s no easy way to implement quasiquotation with them.

They are actually call objects

q4 <- new_quosure(expr(x + y + z))
class(q4)
#> [1] "quosure" "formula"
is.call(q4)
#> [1] TRUE

with an attribute to store the environment

attr(q4, ".Environment")
#> <environment: R_GlobalEnv>

Nested quosures

With quosiquotation we can embed quosures in expressions.

q2 <- new_quosure(expr(x), env(x = 1))
q3 <- new_quosure(expr(x), env(x = 100))

nq <- expr(!!q2 + !!q3)

And evaluate them:

eval_tidy(nq)
#> [1] 101

Display: it is better to use expr_print(x) (note the subtle color differences indicating that the different xs come from different environments).

expr_print(nq)
#> (^x) + (^x)
nq
#> (~x) + ~x

Data mask (1)

Data mask is a special type of quosure where the target environment is a data frame. Thus the evaluated code will look first for the columns in the specified data frame.

Used in packages like dplyr and ggplot2.

Supply the data mask as a second argument to eval_tidy():

q1 <- new_quosure(expr(x * y), env(x = 100))
df <- data.frame(y = 1:10)

eval_tidy(q1, df)
#>  [1]  100  200  300  400  500  600  700  800  900 1000

Data mask (2)

Reimplementing base::with() for the purposes of working with a data.frame:

with2 <- function(data, expr) {
  expr <- enquo(expr)
  eval_tidy(expr, data)
}

But we need to create the objects that are not part of our data mask

x <- 100
with(df, x * y)
#>  [1]  100  200  300  400  500  600  700  800  900 1000
with2(df, x * y)
#>  [1]  100  200  300  400  500  600  700  800  900 1000

Data mask (3)

Also doable with base::eval() instead of rlang::eval_tidy() but a strict base implementation would need to use base::substitute() instead of enquo() (like we did for enexpr()) , and we need to specify the environment.

with3 <- function(data, expr) {
  expr <- substitute(expr)
  eval(expr, data, caller_env())
}
with3(df, x*y)
#>  [1]  100  200  300  400  500  600  700  800  900 1000

Reference ambiguity

When you write

with2(df, y = x)

Do you mean x as the column in df, or as an object in the environment?

(devtools::check() finds these and complains about ‘no visible global bindings’.)

An object value can come from the env or from the data

q1 <- new_quosure(expr(x * y + x), env = env(x = 1))
df <- data.frame(y = 1:5, 
                 x = 10)

eval_tidy(q1, df)
#> [1] 20 30 40 50 60

rlang pronouns: .data$ and .env$

rlang provides pronouns:

  • .data$x: take x from the data mask
  • .env$x: take x from the environment
q1 <- new_quosure(expr(.data$x * y + .env$x), env = env(x = 1))
eval_tidy(q1, df)
#> [1] 11 21 31 41 51

Application: reimplementing base::subset()

base::subset() works like dplyr::filter(): it selects rows of a data frame given an expression.

What do we need?

  • Quote the expression to filter
  • Figure out which rows in the data frame pass the filter
  • Subset and return the data frame

subset2()

subset2 <- function(data, rows) {
  rows <- enquo(rows)
  rows_val <- eval_tidy(rows, data)
  stopifnot("rows argument does not evaluate to a subsetting condition" = 
    is.logical(rows_val) && length(rows_val)==nrow(data))

  data[rows_val, , drop = FALSE]
}
sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 2, 4, 1))

# Shorthand for sample_df[sample_df$b == sample_df$c, ]
subset2(sample_df, b == c)
#>   a b c
#> 1 1 5 5
#> 5 5 1 1
subset2(sample_df, 42)
#> Error in `subset2()`:
#> ! rows argument does not evaluate to a subsetting condition
subset2(sample_df, TRUE)
#> Error in `subset2()`:
#> ! rows argument does not evaluate to a subsetting condition

select() columns (1)

Let us implement one more feature of base::subset(select=_some columns_):

df <- data.frame(a = 1, b = 2, c = 3, d = 4, e = 5)
subset(df, select = b:d)
#>   b c d
#> 1 2 3 4

select2() columns (2)

vars <- as.list(set_names(seq_along(df), names(df)))
str(vars)
#> List of 5
#>  $ a: int 1
#>  $ b: int 2
#>  $ c: int 3
#>  $ d: int 4
#>  $ e: int 5

select2() columns (3)

(base trivia: why do you need drop = FALSE?)

select2 <- function(data, ...) {
  dots <- enquos(...)

  vars <- as.list(set_names(seq_along(data), names(data)))
  cols <- unlist(map(dots, eval_tidy, vars))

  data[, cols, drop = FALSE]
}
select2(df, b:d)
#>   b c d
#> 1 2 3 4

tidyselect helpers: provide more functions that return logical expressions to be added to cols.

Using tidy evaluation

Most of the time we might not do it in our code, but we often end up calling a function that uses eval_tidy() (becoming developer AND user)

Use case: resample and subset

We have a function that resamples a dataset:

resample <- function(df, n) {
  idx <- sample(nrow(df), n, replace = TRUE)
  df[idx, , drop = FALSE]
}
resample(sample_df, 10)
#>     a b c
#> 2   2 4 3
#> 2.1 2 4 3
#> 5   5 1 1
#> 3   3 3 2
#> 4   4 2 4
#> 2.2 2 4 3
#> 2.3 2 4 3
#> 5.1 5 1 1
#> 2.4 2 4 3
#> 3.1 3 3 2

But we also want to use subset and we want to create a function that allow us to resample and subset (with subset2()) in a single step.

Resample and subset (1)

First attempt:

subsample <- function(df, cond, n = nrow(df)) {
  df <- subset2(df, cond)
  resample(df, n)
}
subsample(sample_df, b == c, 3)
#> Error:
#> ! object 'b' not found

What happened?

Resample and subset (2)

subsample <- function(df, cond, n = nrow(df)) {
  df <- subset2(df, cond)
  resample(df, n)
}

subsample() doesn’t quote any arguments and cond is evaluated normally. The caller environment does not have b and c (or what is worse, if it does, the intent of the condition goes out the window.)

b <- 5
c <- c(2,3,4)
subsample(sample_df, b == c, 3)
#> Error in `subset2()`:
#> ! rows argument does not evaluate to a subsetting condition

Resample and subset (3)

So we have to quote cond and unquote it when we pass it to subset2()

subsample <- function(df, cond, n = nrow(df)) {
  cond <- enquo(cond)
  df <- subset2(df, !!cond)
  resample(df, n)
}
subsample(sample_df, b == c, 3)
#>     a b c
#> 5   5 1 1
#> 1   1 5 5
#> 1.1 1 5 5

Ambiguity with “known” column names

Be careful!, potential ambiguity:

threshold_x <- function(df, val) {
  subset2(df, x >= val)
}

What would happen if x exists in the calling environment but doesn’t exist in df? Or if val also exists in df?

So, as developers of threshold_x() and users of subset2(), we have to add disambiguating pronouns:

threshold_x <- function(df, val) {
  subset2(df, .data$x >= .env$val)
}

Tidy eval, summarized

Thresholding on steroids:

threshold_expr <- function(df, expr, val) {
  expr <- enquo(expr)
  subset2(df, !!expr >= !!val)
}

It may not be possible to evaluate expr only in the data mask, because the data mask does not include any functions nor operations like + or ==.

Just remember:

As a general rule of thumb, as a function author it’s your responsibility to avoid ambiguity with any expressions that you create; it’s the user’s responsibility to avoid ambiguity in expressions that they create.

Base evaluation: substitute()

subset_base <- function(data, rows) {
  rows <- substitute(rows)      # instead of enquo()
  rows_val <- eval(rows, data,  # instead of 
                caller_env())   #  eval_tidy()
  stopifnot(is.logical(rows_val))

  data[rows_val, , drop = FALSE]
}

The base world forces us to evaluate in the caller environment rather in the environment where it is defined (quosure); loss of flexibility.

Base evaluation: match.call()

match.call() captures the entire call. A number of base functions (and base-only packages like survey) use it:

lm2 <- function(formula, data) {
  lm(formula, data)
}
lm2(mpg ~ disp, mtcars)
#> 
#> Call:
#> lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)         disp  
#>    29.59985     -0.04122
lm2(mpg ~ disp, mtcars)$call
#> lm(formula = formula, data = data)

Fixing match.call()

lm() itself cannot really do better as that was all it received, and it could not possibly figure out the components. To overcome this problem, we need to capture the arguments as expressions, create the call to lm() using unquoting, then evaluate that call.

lm3 <- function(formula, data, env = caller_env()) {
  formula <- enexpr(formula)
  data <- enexpr(data)

  lm_call <- expr(lm(!!formula, data = !!data))
  expr_print(lm_call)
  eval(lm_call, env)
}

lm3(mpg ~ disp, mtcars)
#> lm(mpg ~ disp, data = mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ disp, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)         disp  
#>    29.59985     -0.04122