3.3 More of {base} and {stats}

R’s {base} and {stats} libraries have lots of built-in functions that help perform statistical analysis. For example, anova() can be used to compare two regression models quickly.

anova(reg_fit, poly_fit)
## Analysis of Variance Table
## 
## Model 1: Volume ~ Girth + Height
## Model 2: Volume ~ Girth + I(Girth^2) + Height
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     28 421.92                                 
## 2     27 186.01  1    235.91 34.243 3.13e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We observe that the second order term for Girth does indeed provide significant explanatory power to the model. (Formally, we reject the null hypothesis that the second order term for Girth is zero.)

What is ANOVA?


Use base R statistical function when someone tries to test your statistics knowledge.

Question: If \(U_1\) and \(U_2\) are i.i.d. (independent and identically distributed) \(Unif(0,1)\) random variables, what is the distribution of \(U_1 + U_2\)?

set.seed(42)
n <- 10000
u_1 <- runif(n)
u_2 <- runif(n)
.hist <- function(x, ...) {
  hist(x, probability = TRUE,...)
  lines(density(x), col = "blue", lwd = 2, ...)
}

layout(matrix(c(1,2,3,3), 2, 2, byrow = TRUE))
.hist(u_1)
.hist(u_2)
.hist(u_1 + u_2)

Answer: Evidently it’s triangular.


There are probably lots of functions that you didn’t know you even needed.

add_column <- function(data) {
  # Whoops! `df` should be `data`
  df %>% mutate(dummy = 1)
}

trees %>% add_column()
## Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "function"

df() is the density function for the F distribution with df1 and df2 degrees of freedom

df
## function (x, df1, df2, ncp, log = FALSE) 
## {
##     if (missing(ncp)) 
##         .Call(C_df, x, df1, df2, log)
##     else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x55e28a6cf4f0>
## <environment: namespace:stats>