29.4 A bit of Mathematics:

Our model is a function of the observed data. This is what we aim to achieve:

\[Y=f(x)\]

The function is made of some coefficients and predictors:

\[f(x)=\beta_{0}+\beta_{1}x\]

When we make a model, we attempt to replicate \(Y=f(x)\) by applying mathematical models to our observed data. The final objective could be the prediction of an outcome.

\(\hat{Y}\) is the result of our model, and this values contain some noise, or residual values which make the model to be slightly different from real values.

\[\hat{Y}=\hat{\beta_{0}}+\hat{\beta_{1}}x+\epsilon\] The residuals are identified by the difference between \(Y\) and \(\hat{Y}\):

\[Y-\hat{Y}=\epsilon\]

What we want is to reduce as much as possible this amount of residuals by selecting different type of models and assessing them on different parameter levels. For this purpose we use some metrics to identify the residual level, such as:

  • the r squared (rsq) \(R^2\)
  • \(\mathrm{adjustedR^2}\)
  • the residual sum of squares \(RSS\)
  • and others

To make the formula in Rmarkdown have a look at this resource: markdown-extensions


Here is the visualization of simulated data from {modelr} package, we have a look at different slope levels.

library(tidyverse)
library(manipulate)
library(modelr)
data(sim1)  
mod1 <- lm(y~x,sim1)

ggplot(sim1,aes(x,y))+
             geom_point()+
             geom_smooth(method="lm")+
             theme_bw()

We can use the function manipulate() from {manipulate} package, to assess the level of the slope to identify the model line:

manipulate(ggplot(sim1,aes(x,y))+
         geom_point()+
         geom_smooth(method="lm")+
         geom_abline(intercept=mod1$coefficients[1],
                     slope=r)+
         theme_bw(),
         r=slider(min=1,max=3,step=0.1))