13.3 Important types of atomic vector
- The four most important types of atomic vectors: logical, integer, double, and character.
Note: Raw and complex are rarely used during a data analysis, hence not part of this of discussion.
13.3.1 Logical
- Simplest type of atomic vector –> can take only three possible values:
FALSE
,TRUE
, andNA
. - They are constructed with comparison operators.
- OR, can create them by hand with
c()
:
Qn: what exactly is %% doing here? I understand is x modulus y…
13.3.2 Numeric
- As learnt, integer and double are known collectively as numeric vectors.
- In R, numbers are doubles by default.
- To create an integer, place
L
after the number.
Note: the warning after running 1.5L is because that integers only take whole numbers. I think.
- Two important differences between doubles and integers:
- Doubles are approximations. They represent floating point numbers that can’t be precisely represented with a fixed amount of memory. Therefore, we should consider all doubles as approximations. E.g., the square root of two:
Side note:
options(scipen = 999)
to remove the scientific numbers as I often confuse reading e- or e+.
- When working with floating point numbers, it is common that the calculations include some approximation.
- Hence, when we compare floating point numbers we should use
dplyr::near()
instead of==
as it allows for some numerical tolerance. (what does this mean?)
- Hence, when we compare floating point numbers we should use
- Integers have one special value:
NA
while doubles have four:NA
,NaN
,Inf
and-Inf
.NaN
,Inf
and-Inf
can arise during division:
- To check for these special values, let’s use the helper functions
is.finite()
,is.infinite()
, andis.nan()
instead of using==
.
13.3.3 Character
- Most complex atomic vector because each element of a character vector is a string, and a string contains an arbitrary amount of data.
- R uses a global string pool.
- Implying that each unique string is only stored in memory once.
- And every use of the string points to that representation.
- This reduces the amount of memory needed by duplicated strings. To see this, let’s use
pryr::object_size()
:
- This reduces the amount of memory needed by duplicated strings. To see this, let’s use
x <- "This is a reasonably long string."
pryr::object_size(x)
y <- rep(x, 1000)
pryr::object_size(y)
y
doesn’t take up 1,000x as much memory as x
, because each element of y
is just a pointer to that same string.
A pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 * 1000 + 152 = 8.14 kB.
13.3.4 Missing values
- Each atomic vector has its own missing value:
But we don’t need to know about these different types since we can always use
NA
and it’ll be converted to the correct type using the implicit coercion rules.However, there are some functions that are strict about their inputs, so it’s useful to have this knowledge sitting in your back pocket so you can be specific when needed.
13.3.5 Exercises
- Describe the difference between is.finite(x) and !is.infinite(x).
is.finite()
function does consider non-missing numeric values to be finite, and -Inf
, NaN
, Inf
are considered not to be finite.
is.infinite()
considers only -Inf
and Inf
as infinite. Hence, !is.infinite()
considers -Inf
and Inf
to be finite while non-missing numeric values, NaN
, and NA
not to be infinite.
- Read the source code for dplyr::near() (Hint: to see the source code, drop the ()). How does it work?
It doesn’t check equality as I first thought, but it checks if two numbers are within a certain tolerance (tol
), usually given as .Machine$double.eps^0.5
, which is the smallest floating point number that the computer can represent. (Good to know!!)
- A logical vector can take 3 possible values. How many possible values can an integer vector take? How many possible values can a double take? Use google to do some research.
For integers vectors, R uses a 32-bit representation. I.e., it can represent \(2^{32}\) different values with the integers. But one of these values is set aside for NA_integer_
.
The range of integers values represented in R is \(+- 2^{31}-1\). Hence, the maximum integer is \(2^{31}-1\) instead of \(2^{32}\) because 1 bit is used to represent the sign \((+ -)\) and one value is to represent \(NA_integer_\).
An integer greater than that value, R will return NA
values.
For double vectors, R uses a 64-bit representation, i.e., they can hold up to \(2^{64}\) values. But, some of those values are assigned to special values: -Inf
, Inf
, NA_real_
, and NaN
.
- Brainstorm at least four functions that allow you to convert a double to an integer. How do they differ? Be precise.
The difference between conversion of a double to an integer differs in how they deal with the fractional part of the double.
- Round down, towards \(-\infty\) i.e., taking the
floor
of a number –>floor()
. - Round up, towards \(\infty\) i.e., taking the
ceiling
of a number –>ceiling()
. - Round towards zero –>
trunc()
andas.integer()
. - Round away from zero.
- Round to the nearest integer. If ties exists, then numbers are defined with a fractional part of 0.5?
- Round half down, towards \(-\infty\).
- Round half up, towards \(\infty\)
- Round half towards zero
- Round half away from zero
- Round half towards the even integer –>
round()
. - Round half towards the odd integer.
tibble(
x = c(1.8, 1.5, 1.2, 0.8, 0.5, 0.2,
-0.2, -0.5, -0.8, -1.2, -1.5, -1.8),
`Round down` = floor(x),
`Round up` = ceiling(x),
`Round towards zero` = trunc(x),
`Nearest, round half to even` = round(x) # 0.5 is rounded to 0
)
- What functions from the readr package allow you to turn a string into logical, integer, and double vector?
parse_logical()
parses logical values, which can appear as variations of TRUE/FALSE or 1/0.
parse_integer()
parses integer values.
In case of any non-numeric characters in the string such as commas, decimals, parse_integer()
will throw an error unlike parse_numeric()
which ignores all the non-numeric characters before or after the first number.