18.2 Explicit missing values

When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward). We can fill down in these missing values with tidyr::fill()

treatment <- tibble::tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         NA,
  "Katherine Burke",  1,         4
)

print(treatment)

## # A tibble: 4 × 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 <NA>                     2       10
## 3 <NA>                     3       NA
## 4 Katherine Burke          1        4

treatment |>
  tidyr::fill(
    dplyr::everything(),
    .direction = "down"
)

## # A tibble: 4 × 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 Derrick Whitmore         2       10
## 3 Derrick Whitmore         3       10
## 4 Katherine Burke          1        4

Missing values may need to be represented with some fixed and known value, most commonly 0. You can use dplyr::coalesce() to replace them

x <- c(1, 4, 5, 7, NA)
dplyr::coalesce(x, 0)

## [1] 1 4 5 7 0

y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
dplyr::coalesce(y, z)

## [1] 1 2 3 4 5

If we need to replace na for multiple columns, tidyr::replace_na is more useful.

df <- tibble::tibble(x = c(1, 2, NA), y = c("a", NA, "b"))

df

## # A tibble: 3 × 2
##       x y    
##   <dbl> <chr>
## 1     1 a    
## 2     2 <NA> 
## 3    NA b

df |> tidyr::replace_na(list(x = 0, y = "unknown"))

## # A tibble: 3 × 2
##       x y      
##   <dbl> <chr>  
## 1     1 a      
## 2     2 unknown
## 3     0 b

On the other hand, some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.

If possible, handle this when reading in the data, for example, by using the na argument to readr::read_csv(), e.g., read_csv(path, na = "99")

If you discover the problem later, or your data source doesn’t provide a way to handle it on read, you can use dplyr::na_if():

x <- c(1, 4, 5, 7, -99)
dplyr::na_if(x, -99)

## [1]  1  4  5  7 NA

R has one special type of missing value called NaN (pronounced “nan”), or not a number. NaN occurs when a mathematical operation that has an indeterminate result:

0 / 0

## [1] NaN

0 * Inf

## [1] NaN

Inf - Inf

## [1] NaN

sqrt(-1)

## Warning in sqrt(-1): NaNs produced

## [1] NaN

NaN generally behaves just like NA.

x <- c(NA, NaN)

x * 10

## [1]  NA NaN

x == 1

## [1] NA NA

is.na(x)

## [1] TRUE TRUE

In the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).

is.nan(x)

## [1] FALSE  TRUE