3.4 Relationship between variables: Linear models and correlation

Examples:

  1. want to know about expression of a particular gene in liver in relation to the dosage of a drug that patient receives
  2. DNA methylation of a certain locus in the genome in relation to the age of the sample donor.
  3. relationship between histone modifications and gene expression.

Y=β0+β1X+ϵ Approximation of the response:

Yβ0+β1X Estimation of the coefficients:

Y=^β0+^β1X

More than one predictor:

Y=β0+β1X1+β2X2+ϵ Y=β0+β1X1+β2X2+β3X3+...+βnXn+ϵ


Y=[1X1,1X1,21X2,1X2,21X3,1X3,21X4,1X4,2][β0β1β2]+[ϵ1ϵ2ϵ3ϵ0]


Y1=β0+β1X1,1+β2X1,2+ϵ1 Y1=β0+β1X2,1+β2X2,2+ϵ2 Y1=β0+β1X3,1+β2X3,2+ϵ3 Y1=β0+β1X4,1+β2X4,2+ϵ4

Y=Xβ+ϵ

3.4.1 The cost or loss function approach

Minimize the residuals: optimization procedure

min(yi=(β0+β1x1))2

The “gradient descent” algorithm”

  1. Pick a random starting point, random β values.

  2. Take the partial derivatives of the cost function to see which direction is the way to go in the cost function.

  3. Take a step toward the direction that minimizes the cost function.

    • Step size is a parameter to choose, there are many variants.
  4. Repeat step 2,3 until convergence.

The algorithm usually converges to optimum β values.

Cost function landscape for linear regression with changing beta values. The optimization process tries to find the lowest point in this landscape by implementing a strategy for updating beta values toward the lowest point in the landscape.

Figure 3.1: Cost function landscape for linear regression with changing beta values. The optimization process tries to find the lowest point in this landscape by implementing a strategy for updating beta values toward the lowest point in the landscape.

3.4.2 The “maximum likelihood” approach

  • The response variable yi follows a normal distribution with mean β0+β1xi and variance s2.

  • Find β0 and β1 that maximizes the probability of observing all the response variables in the dataset given the explanatory variables.

  • constant variance s2, estimation of the variance of the population σ2

s2=ϵin2 Probability:

P(yi)=1s2πe12(yi(β0+β1xi)s)2

Likelihood function:

L=P(y1)P(y2)P(y3)..P(yn)=ni=1Pi

The log:

ln(L)=nln(s2π)12s2i=1n(yi(β0+β1xi))2 the negative of the cost function: 12s2i=1n(yi(β0+β1xi))2 to optimize this function we would need to take the derivative of the function with respect to the parameters.

3.4.3 Linear algebra and closed-form solution to linear regression

ϵi=Yi(β0+β1xi)

ϵ2i=ϵTϵ=(YβX)T(YβX) =YTY(2βTY+βTXTXβ)

ˆβ=(XTX)1XTY

^β1=(xiˉX)(yiˉY)(xiˉX)2

^β0=ˉY^β1ˉX

# set random number seed, so that the random numbers from the text
# is the same when you run the code.
set.seed(32)

# get 50 X values between 1 and 100
x = runif(50,1,100)

# set b0,b1 and variance (sigma)
b0 = 10
b1 = 2
sigma = 20
# simulate error terms from normal distribution
eps = rnorm(50,0,sigma)
# get y values from the linear equation and addition of error terms
y = b0 + b1*x+ eps
mod1=lm(y~x)

# plot the data points
plot(x,y,pch=20,
     ylab="Gene Expression",xlab="Histone modification score")
# plot the linear fit
abline(mod1,col="blue")

3.4.4 How to estimate the error of the coefficients

Regression coefficients vary with every random sample. The figure illustrates the variability of regression coefficients when regression is done using a sample of data points. Histograms depict this variability for b0 and b1 coefficients.

Figure 3.2: Regression coefficients vary with every random sample. The figure illustrates the variability of regression coefficients when regression is done using a sample of data points. Histograms depict this variability for b0 and b1 coefficients.

The calculation of the Residual Standard Error (RSE)

s=RSE

mod1=lm(y~x)
summary(mod1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38.50 -12.62   1.66  14.26  37.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.40951    6.43208   0.064     0.95    
## x            2.08742    0.09775  21.355   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.96 on 48 degrees of freedom
## Multiple R-squared:  0.9048, Adjusted R-squared:  0.9028 
## F-statistic:   456 on 1 and 48 DF,  p-value: < 2.2e-16
# get confidence intervals 
confint(mod1)
##                  2.5 %    97.5 %
## (Intercept) -12.523065 13.342076
## x             1.890882  2.283952
# pull out coefficients from the model
coef(mod1)
## (Intercept)           x 
##   0.4095054   2.0874167

3.4.5 Accuracy of the model

s=RSE=(yi^Yi)2np=RSSnp


R2=1RSSTSS=TSSRSSTSS=1RSSTSS


rxy=cov(X,Y)σxσy=ni=1(xiˉx)(yiˉy)ni=1(xiˉx)2ni=1(yiˉy)2

Correlation and covariance for different scatter plots.

Figure 3.3: Correlation and covariance for different scatter plots.

H0:β1=β2=β3=...=βp=0

H1:at least oneβ10

The F-statistic for a linear model

F=(TSSRSS)/(p1)RSS/(np)=(TSSRSS)/(p1)RSEF(p1,np)

3.4.6 Regression with categorical variables

set.seed(100)
gene1=rnorm(30,mean=4,sd=2)
gene2=rnorm(30,mean=2,sd=2)
gene.df=data.frame(exp=c(gene1,gene2),
                  group=c( rep(1,30),rep(0,30) ) )

mod2=lm(exp~group,data=gene.df)
summary(mod2)
## 
## Call:
## lm(formula = exp ~ group, data = gene.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7290 -1.0664  0.0122  1.3840  4.5629 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1851     0.3517   6.214 6.04e-08 ***
## group         1.8726     0.4973   3.765 0.000391 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.926 on 58 degrees of freedom
## Multiple R-squared:  0.1964, Adjusted R-squared:  0.1826 
## F-statistic: 14.18 on 1 and 58 DF,  p-value: 0.0003905
require(mosaic)
plotModel(mod2)

gene.df=data.frame(exp=c(gene1,gene2,gene2),
                  group=c( rep("A",30),rep("B",30),rep("C",30) ) 
                  )

mod3=lm(exp~group,data=gene.df)
summary(mod3)
## 
## Call:
## lm(formula = exp ~ group, data = gene.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7290 -1.0793 -0.0976  1.4844  4.5629 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.0577     0.3781  10.731  < 2e-16 ***
## groupB       -1.8726     0.5348  -3.502 0.000732 ***
## groupC       -1.8726     0.5348  -3.502 0.000732 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.071 on 87 degrees of freedom
## Multiple R-squared:  0.1582, Adjusted R-squared:  0.1388 
## F-statistic: 8.174 on 2 and 87 DF,  p-value: 0.0005582

3.4.7 Regression pitfalls

  • Non-linearity
  • Correlation of explanatory variables
  • Correlation of error terms
  • Non-constant variance of error terms
  • Outliers and high leverage points