Functionals

Learning objectives:

  • Define functionals.
  • Use the purrr::map() family of functionals.
  • Use the purrr::walk() family of functionals.
  • Use the purrr::reduce() and purrr::accumulate() family of functionals.
  • Use purrr::safely() and purrr::possibly() to deal with failure.

9.1. Introduction

9.2. map()

9.3. purrr style

9.4. map_ variants

9.5. reduce() and accumulate family of functions

  • Some functions that weren’t covered

What are functionals

Introduction

Functionals are functions that take function as input and return a vector as output. Functionals that you probably have used before are: apply(), lapply() or tapply().

  • alternatives to loops

  • a functional is better than a for loop is better than while is better than repeat

Benefits

  • encourages function logic to be separated from iteration logic

  • can collapse into vectors/data frames easily

Map

map() has two arguments, a vector and a function. It performs the function on each element of the vector and returns a list. We can also pass in some additional argument into the function.

simple_map <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
out
}

Benefit of using the map function in purrr

  • purrr::map() is equivalent to lapply()

  • returns a list and is the most general

  • the length of the input == the length of the output

  • map() is more flexible, with additional arguments allowed

  • map() has a host of extensions

Atomic vectors

  • has 4 variants to return atomic vectors
    • map_chr()
    • map_dbl()
    • map_int()
    • map_lgl()
triple <- function(x) x * 3
map(.x=1:3, .f=triple)
#> [[1]]
#> [1] 3
#> 
#> [[2]]
#> [1] 6
#> 
#> [[3]]
#> [1] 9
map_dbl(.x=1:3, .f=triple)
#> [1] 3 6 9
map_lgl(.x=c(1, NA, 3), .f=is.na)
#> [1] FALSE  TRUE FALSE

Anonymous functions and shortcuts

Anonymous functions

map_dbl(.x=mtcars, .f=function(x) mean(x, na.rm = TRUE)) |> 
  head()
#>        mpg        cyl       disp         hp       drat         wt 
#>  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
  • the “twiddle” uses a twiddle ~ to set a formula
  • can use .x to reference the input map(.x = ..., .f = )
map_dbl(.x=mtcars,  .f=~mean(.x, na.rm = TRUE))
  • can be simplified further as
map_dbl(.x=mtcars, .f=mean, na.rm = TRUE)
#>        mpg        cyl       disp         hp       drat         wt       qsec 
#>  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
#>         vs         am       gear       carb 
#>   0.437500   0.406250   3.687500   2.812500
  • what happens when we try a handful of variants of the task at hand? (how many unique values are there for each variable?)

Note that .x is the name of the first argument in map() (.f is the name of the second argument).

# the task
map_dbl(mtcars, function(x) length(unique(x)))
#>  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
#>   25    3   27   22   22   29   30    2    2    3    6
map_dbl(mtcars, function(unicorn) length(unique(unicorn)))
#>  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
#>   25    3   27   22   22   29   30    2    2    3    6
map_dbl(mtcars, ~length(unique(.x)))
#>  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
#>   25    3   27   22   22   29   30    2    2    3    6
map_dbl(mtcars, ~length(unique(..1)))
#>  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
#>   25    3   27   22   22   29   30    2    2    3    6
map_dbl(mtcars, ~length(unique(.)))
#>  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
#>   25    3   27   22   22   29   30    2    2    3    6
# not the task
map_dbl(mtcars, length)
#>  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
#>   32   32   32   32   32   32   32   32   32   32   32
map_dbl(mtcars, length(unique))
#>    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
#>  21.00   6.00 160.00 110.00   3.90   2.62  16.46   0.00   1.00   4.00   4.00
map_dbl(mtcars, 1)
#>    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
#>  21.00   6.00 160.00 110.00   3.90   2.62  16.46   0.00   1.00   4.00   4.00
#error
map_dbl(mtcars, length(unique()))
#> Error in unique.default(): argument "x" is missing, with no default
map_dbl(mtcars, ~length(unique(x)))
#> Error in `map_dbl()`:
#> ℹ In index: 1.
#> ℹ With name: mpg.
#> Caused by error in `.f()`:
#> ! object 'x' not found

Modify

Sometimes we might want the output to be the same as the input, then in that case we can use the modify function rather than map

df <- data.frame(x=1:3,y=6:4)

map(df, .f=~.x*3)
#> $x
#> [1] 3 6 9
#> 
#> $y
#> [1] 18 15 12
modify(.x=df,.f=~.x*3)
#>   x  y
#> 1 3 18
#> 2 6 15
#> 3 9 12

Note that modify() always returns the same type of output (which is not necessarily true with map()). Additionally, modify() does not actually change the value of df.

df
#>   x y
#> 1 1 6
#> 2 2 5
#> 3 3 4

purrr style

mtcars |> 
  map(head, 20) |> # pull first 20 of each column
  map_dbl(mean) |> # mean of each vector
  head()
#>       mpg       cyl      disp        hp      drat        wt 
#>  20.13000   6.20000 233.93000 136.20000   3.54500   3.39845

An example from tidytuesday

tt <- tidytuesdayR::tt_load("2020-06-30")

# filter data & exclude columns with lost of nulls
list_df <- 
  map(
    .x = tt[1:3], 
    .f = 
      ~ .x |> 
      filter(issue <= 152 | issue > 200) |> 
      mutate(timeframe = ifelse(issue <= 152, "first 5 years", "last 5 years")) |> 
      select_if(~mean(is.na(.x)) < 0.2) 
  )


# write to global environment
iwalk(
  .x = list_df,
  .f = ~ assign(x = .y, value = .x, envir = globalenv())
)

map_*() variants

There are many variants

map2_*()

  • raise each value .x by 2
map_dbl(
  .x = 1:5, 
  .f = function(x) x ^ 2
)
#> [1]  1  4  9 16 25
  • raise each value .x by another value .y
map2_dbl(
  .x = 1:5, 
  .y = 2:6, 
  .f = ~ (.x ^ .y)
)
#> [1]     1     8    81  1024 15625

The benefit of using the map over apply family of function

  • It is written in C
  • It preserves names
  • We always know the return value type
  • We can apply the function for multiple input values
  • We can pass additional arguments into the function

walk()

  • We use walk() when we want to call a function for it side effect(s) rather than its return value, like generating plots, write.csv(), or ggsave(). If you don’t want a return value, map() will print more info than you may want.
map(1:3, ~cat(.x, "\n"))
#> 1 
#> 2 
#> 3
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
  • for these cases, use walk() instead
walk(1:3, ~cat(.x, "\n"))
#> 1 
#> 2 
#> 3

cat() does have a result, it’s just usually returned invisibly.

cat("hello")
#> hello
(cat("hello"))
#> hello
#> NULL

We can use pwalk() to save a list of plot to disk. Note that the “p” in pwalk() means that we have more than 1 (or 2) variables to pipe into the function. Also note that the name of the first argument in all of the “p” functions is now .l (instead of .x).

plots <- mtcars |>  
  split(mtcars$cyl) |>  
  map(~ggplot(.x, aes(mpg,wt)) +
        geom_point())

paths <- stringr::str_c(names(plots), '.png')

pwalk(.l = list(paths,plots), .f = ggsave, path = tempdir())
#> Saving 7 x 5 in image
#> Saving 7 x 5 in image
#> Saving 7 x 5 in image
pmap(.l = list(paths,plots), .f = ggsave, path = tempdir())
#> Saving 7 x 5 in image
#> Saving 7 x 5 in image
#> Saving 7 x 5 in image
#> [[1]]
#> [1] "C:\\Users\\jonth\\AppData\\Local\\Temp\\RtmpSMVSZz/4.png"
#> 
#> [[2]]
#> [1] "C:\\Users\\jonth\\AppData\\Local\\Temp\\RtmpSMVSZz/6.png"
#> 
#> [[3]]
#> [1] "C:\\Users\\jonth\\AppData\\Local\\Temp\\RtmpSMVSZz/8.png"
  • walk, walk2 and pwalk all invisibly return .x the first argument. This makes them suitable for use in the middle of pipelines.

  • note: I don’t think that it is “.x” (or “.l”) that they are returning invisibly. But I’m not sure what it is. Hadley says:

purrr provides the walk family of functions that ignore the return values of the .f and instead return .x invisibly.

But not in the first cat() example, it is the NULL values that get returned invisibly (those aren’t the same as .x).

imap()

  • imap() is like map2()except that .y is derived from names(.x) if named or seq_along(.x) if not.

  • These two produce the same result

imap_chr(.x = mtcars, 
         .f = ~ paste(.y, "has a mean of", round(mean(.x), 1))) |> 
head()
#>                        mpg                        cyl 
#>   "mpg has a mean of 20.1"    "cyl has a mean of 6.2" 
#>                       disp                         hp 
#> "disp has a mean of 230.7"   "hp has a mean of 146.7" 
#>                       drat                         wt 
#>   "drat has a mean of 3.6"     "wt has a mean of 3.2"
map2_chr(.x = mtcars, 
         .y = names(mtcars),
         .f = ~ paste(.y, "has a mean of", round(mean(.x), 1))) |> 
head()
#>                        mpg                        cyl 
#>   "mpg has a mean of 20.1"    "cyl has a mean of 6.2" 
#>                       disp                         hp 
#> "disp has a mean of 230.7"   "hp has a mean of 146.7" 
#>                       drat                         wt 
#>   "drat has a mean of 3.6"     "wt has a mean of 3.2"

pmap()

  • you can pass a named list or dataframe as arguments to a function

  • for example runif() has the parameters n, min and max

params <- tibble::tribble(
  ~ n, ~ min, ~ max,
   1L,     1,    10,
   2L,    10,   100,
   3L,   100,  1000
)

pmap(params, runif)
#> [[1]]
#> [1] 9.52234
#> 
#> [[2]]
#> [1] 49.53679 46.47017
#> 
#> [[3]]
#> [1] 488.8100 796.6801 282.7772
  • could also be
list(
  n = 1:3, 
  min = 10 ^ (0:2), 
  max = 10 ^ (1:3)
) |> 
pmap(runif)
#> [[1]]
#> [1] 5.246834
#> 
#> [[2]]
#> [1] 73.39068 30.57879
#> 
#> [[3]]
#> [1] 169.6667 950.1126 820.9357
  • I like to use expand_grid() when I want all possible parameter combinations.
expand_grid(n = 1:3,
            min = 10 ^ (0:1),
            max = 10 ^ (1:2))
#> # A tibble: 12 × 3
#>        n   min   max
#>    <int> <dbl> <dbl>
#>  1     1     1    10
#>  2     1     1   100
#>  3     1    10    10
#>  4     1    10   100
#>  5     2     1    10
#>  6     2     1   100
#>  7     2    10    10
#>  8     2    10   100
#>  9     3     1    10
#> 10     3     1   100
#> 11     3    10    10
#> 12     3    10   100
expand_grid(n = 1:3,
            min = 10 ^ (0:1),
            max = 10 ^ (1:2)) |> 
pmap(runif)
#> [[1]]
#> [1] 9.474848
#> 
#> [[2]]
#> [1] 10.63548
#> 
#> [[3]]
#> [1] 10
#> 
#> [[4]]
#> [1] 92.44257
#> 
#> [[5]]
#> [1] 7.165047 6.201947
#> 
#> [[6]]
#> [1] 64.79074 16.54110
#> 
#> [[7]]
#> [1] 10 10
#> 
#> [[8]]
#> [1] 62.12314 52.31713
#> 
#> [[9]]
#> [1] 6.806213 5.541865 8.580469
#> 
#> [[10]]
#> [1]  7.10806 51.56879 85.70133
#> 
#> [[11]]
#> [1] 10 10 10
#> 
#> [[12]]
#> [1] 74.48871 11.65879 58.31278

reduce() family

The reduce() function is a powerful functional that allows you to abstract away from a sequence of functions that are applied in a fixed direction.

reduce() takes a vector as its first argument, a function as its second argument, and an optional .init argument last. It will then apply the function repeatedly to the vector until there is only a single element left.

(Hint: start at the top of the image and read down.)

Let me really quickly demonstrate reduce() in action.

Say you wanted to add up the numbers 1 through 5 using only the plus operator +. You could do something like:

1 + 2 + 3 + 4 + 5
#> [1] 15

Which is the same as:

reduce(1:5, `+`)
#> [1] 15

And if you want the start value to be something that is not the first argument of the vector, pass that value to the .init argument:

identical(
  0.5 + 1 + 2 + 3 + 4 + 5,
  reduce(1:5, `+`, .init = 0.5)
)
#> [1] TRUE

ggplot2 example with reduce

ggplot(mtcars, aes(hp, mpg)) + 
  geom_point(size = 8, alpha = .5, color = "yellow") +
  geom_point(size = 4, alpha = .5, color = "red") +
  geom_point(size = 2, alpha = .5, color = "blue")

Let us use the reduce() function. Note that reduce2() takes two arguments, but the first value (..1) is given by the .init value.

reduce2(
  c(8, 4, 2),
  c("yellow", "red", "blue"),
  ~ ..1 + geom_point(size = ..2, alpha = .5, color = ..3),
  .init = ggplot(mtcars, aes(hp, mpg))
)

df <- list(age=tibble(name='john',age=30),
    sex=tibble(name=c('john','mary'),sex=c('M','F'),
    trt=tibble(name='Mary',treatment='A')))

df
#> $age
#> # A tibble: 1 × 2
#>   name    age
#>   <chr> <dbl>
#> 1 john     30
#> 
#> $sex
#> # A tibble: 2 × 3
#>   name  sex   trt$name $treatment
#>   <chr> <chr> <chr>    <chr>     
#> 1 john  M     Mary     A         
#> 2 mary  F     Mary     A
df |> reduce(.f = full_join)
#> Joining with `by = join_by(name)`
#> # A tibble: 2 × 4
#>   name    age sex   trt$name $treatment
#>   <chr> <dbl> <chr> <chr>    <chr>     
#> 1 john     30 M     Mary     A         
#> 2 mary     NA F     Mary     A
reduce(.x = df,.f = full_join)
#> Joining with `by = join_by(name)`
#> # A tibble: 2 × 4
#>   name    age sex   trt$name $treatment
#>   <chr> <dbl> <chr> <chr>    <chr>     
#> 1 john     30 M     Mary     A         
#> 2 mary     NA F     Mary     A
  • to see all intermediate steps, use accumulate()
set.seed(1234)
accumulate(1:5, `+`)
#> [1]  1  3  6 10 15
accumulate2(
  c(8, 4, 2),
  c("yellow", "red", "blue"),
  ~ ..1 + geom_point(size = ..2, alpha = .5, color = ..3),
  .init = ggplot(mtcars, aes(hp, mpg))
)
#> [[1]]

#> 
#> [[2]]

#> 
#> [[3]]

#> 
#> [[4]]

map_df*() variants

  • map_dfr() = row bind the results

  • map_dfc() = column bind the results

  • Note that map_dfr() has been superseded by map() |> list_rbind() and map_dfc() has been superseded by map() |> list_cbind()

col_stats <- function(n) {
  head(mtcars, n) |> 
    summarise_all(mean) |> 
    mutate_all(floor) |> 
    mutate(n = paste("N =", n))
}

map((1:2) * 10, col_stats)
#> [[1]]
#>   mpg cyl disp  hp drat wt qsec vs am gear carb      n
#> 1  20   5  208 122    3  3   18  0  0    3    2 N = 10
#> 
#> [[2]]
#>   mpg cyl disp  hp drat wt qsec vs am gear carb      n
#> 1  20   6  233 136    3  3   18  0  0    3    2 N = 20
map_dfr((1:2) * 10, col_stats)
#>   mpg cyl disp  hp drat wt qsec vs am gear carb      n
#> 1  20   5  208 122    3  3   18  0  0    3    2 N = 10
#> 2  20   6  233 136    3  3   18  0  0    3    2 N = 20
map((1:2) * 10, col_stats) |> list_rbind()
#>   mpg cyl disp  hp drat wt qsec vs am gear carb      n
#> 1  20   5  208 122    3  3   18  0  0    3    2 N = 10
#> 2  20   6  233 136    3  3   18  0  0    3    2 N = 20

pluck()

  • pluck() will pull a single element from a list

I like the example from the book because the starting object is not particularly easy to work with (as many JSON objects might not be).

my_list <- list(
  list(-1, x = 1, y = c(2), z = "a"),
  list(-2, x = 4, y = c(5, 6), z = "b"),
  list(-3, x = 8, y = c(9, 10, 11))
)
my_list
#> [[1]]
#> [[1]][[1]]
#> [1] -1
#> 
#> [[1]]$x
#> [1] 1
#> 
#> [[1]]$y
#> [1] 2
#> 
#> [[1]]$z
#> [1] "a"
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> [1] -2
#> 
#> [[2]]$x
#> [1] 4
#> 
#> [[2]]$y
#> [1] 5 6
#> 
#> [[2]]$z
#> [1] "b"
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] -3
#> 
#> [[3]]$x
#> [1] 8
#> 
#> [[3]]$y
#> [1]  9 10 11

Notice that the “first element” means something different in standard pluck() versus mapped pluck().

pluck(my_list, 1)
#> [[1]]
#> [1] -1
#> 
#> $x
#> [1] 1
#> 
#> $y
#> [1] 2
#> 
#> $z
#> [1] "a"
map(my_list, pluck, 1)
#> [[1]]
#> [1] -1
#> 
#> [[2]]
#> [1] -2
#> 
#> [[3]]
#> [1] -3
map_dbl(my_list, pluck, 1)
#> [1] -1 -2 -3

The map() functions also have shortcuts for extracting elements from vectors (powered by purrr::pluck()). Note that map(my_list, 3) is a shortcut for map(my_list, pluck, 3).

# Select by name
map_dbl(my_list, "x")
#> [1] 1 4 8
# Or by position
map_dbl(my_list, 1)
#> [1] -1 -2 -3
# Or by both
map_dbl(my_list, list("y", 1))
#> [1] 2 5 9
# You'll get an error if you try to retrieve an inside item that doesn't have 
# a consistent format and you want a numeric output
map_dbl(my_list, list("y"))
#> Error in `map_dbl()`:
#> ℹ In index: 2.
#> Caused by error:
#> ! Result must be length 1, not 2.
# You'll get an error if a component doesn't exist:
map_chr(my_list, "z")
#> Error in `map_chr()`:
#> ℹ In index: 3.
#> Caused by error:
#> ! Result must be length 1, not 0.
#> Error: Result 3 must be a single string, not NULL of length 0

# Unless you supply a .default value
map_chr(my_list, "z", .default = NA)
#> [1] "a" "b" NA
#> [1] "a" "b" NA

Not covered: flatten()

  • flatten() will turn a list of lists into a simpler vector.
my_list <-
  list(
    a = 1:3,
    b = list(1:3)
  )

my_list
#> $a
#> [1] 1 2 3
#> 
#> $b
#> $b[[1]]
#> [1] 1 2 3
map_if(my_list, is.list, pluck)
#> $a
#> [1] 1 2 3
#> 
#> $b
#> $b[[1]]
#> [1] 1 2 3
map_if(my_list, is.list, flatten_int)
#> $a
#> [1] 1 2 3
#> 
#> $b
#> [1] 1 2 3
map_if(my_list, is.list, flatten_int) |> 
  flatten_int()
#> [1] 1 2 3 1 2 3

Dealing with Failures

Safely

safely() is an adverb. It takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead it always returns a list with two elements.

  • result is the original result. If there is an error this will be NULL

  • error is an error object. If the operation was successful the “error” will be NULL.

A <- list(1, 10, "a")

map(.x = A, .f = safely(log))
#> [[1]]
#> [[1]]$result
#> [1] 0
#> 
#> [[1]]$error
#> NULL
#> 
#> 
#> [[2]]
#> [[2]]$result
#> [1] 2.302585
#> 
#> [[2]]$error
#> NULL
#> 
#> 
#> [[3]]
#> [[3]]$result
#> NULL
#> 
#> [[3]]$error
#> <simpleError in .Primitive("log")(x, base): non-numeric argument to mathematical function>

Possibly

possibly() always succeeds. It is simpler than safely(), because you can give it a default value to return when there is an error.

A <- list(1,10,"a")

map_dbl(.x = A, .f = possibly(log, otherwise = NA_real_) )
#> [1] 0.000000 2.302585       NA