13.3 Important types of atomic vector

  • The four most important types of atomic vectors: logical, integer, double, and character.

Note: Raw and complex are rarely used during a data analysis, hence not part of this of discussion.

13.3.1 Logical

  • Simplest type of atomic vector –> can take only three possible values: FALSE, TRUE, and NA.
  • They are constructed with comparison operators.
  • OR, can create them by hand with c():
1:10 %% 3 == 0

c(TRUE, TRUE, FALSE, NA)

Qn: what exactly is %% doing here? I understand is x modulus y…

13.3.2 Numeric

  • As learnt, integer and double are known collectively as numeric vectors.
  • In R, numbers are doubles by default.
  • To create an integer, place L after the number.
typeof(1)

typeof(1L)

1.5L

Note: the warning after running 1.5L is because that integers only take whole numbers. I think.

  • Two important differences between doubles and integers:
    1. Doubles are approximations. They represent floating point numbers that can’t be precisely represented with a fixed amount of memory. Therefore, we should consider all doubles as approximations. E.g., the square root of two:
(x3 <- sqrt(2) ^ 2)

options(scipen = 999)
x3 - 2

Side note: options(scipen = 999) to remove the scientific numbers as I often confuse reading e- or e+.

  • When working with floating point numbers, it is common that the calculations include some approximation.
    • Hence, when we compare floating point numbers we should use dplyr::near() instead of == as it allows for some numerical tolerance. (what does this mean?)
dplyr::near(1.745, 2)
# compares if 1.745 is the same as 2
# FALSE
  1. Integers have one special value: NA while doubles have four: NA, NaN, Inf and -Inf.
    • NaN, Inf and -Inf can arise during division:
c(-1, 0, 1) / 0
  • To check for these special values, let’s use the helper functions is.finite(), is.infinite(), and is.nan() instead of using ==.
is.finite(c(-1, 0, 1) / 0)
is.infinite(c(-1, 0, 1) / 0)
is.na(c(-1, 0, 1) / 0)
is.nan(c(-1, 0, 1) / 0)

13.3.3 Character

  • Most complex atomic vector because each element of a character vector is a string, and a string contains an arbitrary amount of data.
  • R uses a global string pool.
    • Implying that each unique string is only stored in memory once.
    • And every use of the string points to that representation.
      • This reduces the amount of memory needed by duplicated strings. To see this, let’s use pryr::object_size():
x <- "This is a reasonably long string."
pryr::object_size(x)

y <- rep(x, 1000)
pryr::object_size(y)

y doesn’t take up 1,000x as much memory as x, because each element of y is just a pointer to that same string.

A pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 * 1000 + 152 = 8.14 kB.

13.3.4 Missing values

  • Each atomic vector has its own missing value:
NA            # logical

NA_integer_   # integer

NA_real_      # double

NA_character_ # character
  • But we don’t need to know about these different types since we can always use NA and it’ll be converted to the correct type using the implicit coercion rules.

  • However, there are some functions that are strict about their inputs, so it’s useful to have this knowledge sitting in your back pocket so you can be specific when needed.

13.3.5 Exercises

  1. Describe the difference between is.finite(x) and !is.infinite(x).
(x <- c(-1/0, 0/0, 1/0, 5, 5L, NA))
is.finite(x)
is.infinite(x)
!is.infinite(x)

is.finite() function does consider non-missing numeric values to be finite, and -Inf, NaN, Inf are considered not to be finite.

is.infinite() considers only -Inf and Inf as infinite. Hence, !is.infinite() considers -Inf and Inf to be finite while non-missing numeric values, NaN, and NA not to be infinite.

  1. Read the source code for dplyr::near() (Hint: to see the source code, drop the ()). How does it work?
dplyr::near()

It doesn’t check equality as I first thought, but it checks if two numbers are within a certain tolerance (tol), usually given as .Machine$double.eps^0.5, which is the smallest floating point number that the computer can represent. (Good to know!!)

  1. A logical vector can take 3 possible values. How many possible values can an integer vector take? How many possible values can a double take? Use google to do some research.
help(integer)
help(double)

For integers vectors, R uses a 32-bit representation. I.e., it can represent \(2^{32}\) different values with the integers. But one of these values is set aside for NA_integer_.

.Machine$integer.max
.Machine$integer.max + 1L

The range of integers values represented in R is \(+- 2^{31}-1\). Hence, the maximum integer is \(2^{31}-1\) instead of \(2^{32}\) because 1 bit is used to represent the sign \((+ -)\) and one value is to represent \(NA_integer_\).

An integer greater than that value, R will return NA values.

For double vectors, R uses a 64-bit representation, i.e., they can hold up to \(2^{64}\) values. But, some of those values are assigned to special values: -Inf, Inf, NA_real_, and NaN.

.Machine$double.xmax
  1. Brainstorm at least four functions that allow you to convert a double to an integer. How do they differ? Be precise.

The difference between conversion of a double to an integer differs in how they deal with the fractional part of the double.

  • Round down, towards \(-\infty\) i.e., taking the floor of a number –> floor().
  • Round up, towards \(\infty\) i.e., taking the ceiling of a number –> ceiling().
  • Round towards zero –> trunc() and as.integer().
  • Round away from zero.
  • Round to the nearest integer. If ties exists, then numbers are defined with a fractional part of 0.5?
    • Round half down, towards \(-\infty\).
    • Round half up, towards \(\infty\)
    • Round half towards zero
    • Round half away from zero
    • Round half towards the even integer –> round().
    • Round half towards the odd integer.
tibble(
  x = c(1.8, 1.5, 1.2, 0.8, 0.5, 0.2, 
        -0.2, -0.5, -0.8, -1.2, -1.5, -1.8),
  `Round down` = floor(x),
  `Round up` = ceiling(x),
  `Round towards zero` = trunc(x),
  `Nearest, round half to even` = round(x) # 0.5 is rounded to 0
) 
  1. What functions from the readr package allow you to turn a string into logical, integer, and double vector?

parse_logical() parses logical values, which can appear as variations of TRUE/FALSE or 1/0.

parse_logical(c("TRUE", "FALSE", "1", "0", "true", "t", "NA"))

parse_integer() parses integer values.

parse_integer(c("1235", "0134", "NA"))

In case of any non-numeric characters in the string such as commas, decimals, parse_integer() will throw an error unlike parse_numeric() which ignores all the non-numeric characters before or after the first number.

parse_integer(c("1000", "$1,000", "10.00"))
parse_number(c("1.0", "3.5", "$1,000.00", "NA", "ABCD12234.90", "1234ABC", "A123B", "A1B2C"))