3.4 Relationship between variables: Linear models and correlation
Examples:
- want to know about expression of a particular gene in liver in relation to the dosage of a drug that patient receives
- DNA methylation of a certain locus in the genome in relation to the age of the sample donor.
- relationship between histone modifications and gene expression.
Y=β0+β1X+ϵ Approximation of the response:
Y∼β0+β1X Estimation of the coefficients:
Y=^β0+^β1X
More than one predictor:
Y=β0+β1X1+β2X2+ϵ Y=β0+β1X1+β2X2+β3X3+...+βnXn+ϵ
Y=[1X1,1X1,21X2,1X2,21X3,1X3,21X4,1X4,2][β0β1β2]+[ϵ1ϵ2ϵ3ϵ0]
Y1=β0+β1X1,1+β2X1,2+ϵ1
Y1=β0+β1X2,1+β2X2,2+ϵ2
Y1=β0+β1X3,1+β2X3,2+ϵ3
Y1=β0+β1X4,1+β2X4,2+ϵ4
Y=Xβ+ϵ
3.4.1 The cost or loss function approach
Minimize the residuals: optimization procedure
min∑(yi=(β0+β1x1))2
The “gradient descent” algorithm”
Pick a random starting point, random β values.
Take the partial derivatives of the cost function to see which direction is the way to go in the cost function.
Take a step toward the direction that minimizes the cost function.
- Step size is a parameter to choose, there are many variants.
Repeat step 2,3 until convergence.
The algorithm usually converges to optimum β values.

Figure 3.1: Cost function landscape for linear regression with changing beta values. The optimization process tries to find the lowest point in this landscape by implementing a strategy for updating beta values toward the lowest point in the landscape.
3.4.2 The “maximum likelihood” approach
The response variable yi follows a normal distribution with mean β0+β1xi and variance s2.
Find β0 and β1 that maximizes the probability of observing all the response variables in the dataset given the explanatory variables.
constant variance s2, estimation of the variance of the population σ2
s2=∑ϵin−2 Probability:
P(yi)=1s√2πe−12(yi−(β0+β1xi)s)2
Likelihood function:
L=P(y1)P(y2)P(y3)..P(yn)=n∏i=1Pi
The log:
ln(L)=−nln(s√2π)−12s2∑i=1n(yi−(β0+β1xi))2 the negative of the cost function: −12s2∑i=1n(yi−(β0+β1xi))2 to optimize this function we would need to take the derivative of the function with respect to the parameters.
3.4.3 Linear algebra and closed-form solution to linear regression
ϵi=Yi−(β0+β1xi)
∑ϵ2i=ϵTϵ=(Y−βX)T(Y−βX)
=YTY−(2βTY+βTXTXβ)
ˆβ=(XTX)−1XTY
^β1=∑(xi−ˉX)(yi−ˉY)∑(xi−ˉX)2
^β0=ˉY−^β1ˉX
# set random number seed, so that the random numbers from the text
# is the same when you run the code.
set.seed(32)
# get 50 X values between 1 and 100
x = runif(50,1,100)
# set b0,b1 and variance (sigma)
b0 = 10
b1 = 2
sigma = 20
# simulate error terms from normal distribution
eps = rnorm(50,0,sigma)
# get y values from the linear equation and addition of error terms
y = b0 + b1*x+ eps
mod1=lm(y~x)
# plot the data points
plot(x,y,pch=20,
ylab="Gene Expression",xlab="Histone modification score")
# plot the linear fit
abline(mod1,col="blue")
3.4.4 How to estimate the error of the coefficients

Figure 3.2: Regression coefficients vary with every random sample. The figure illustrates the variability of regression coefficients when regression is done using a sample of data points. Histograms depict this variability for b0 and b1 coefficients.
The calculation of the Residual Standard Error (RSE)
s=RSE
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.50 -12.62 1.66 14.26 37.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40951 6.43208 0.064 0.95
## x 2.08742 0.09775 21.355 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.96 on 48 degrees of freedom
## Multiple R-squared: 0.9048, Adjusted R-squared: 0.9028
## F-statistic: 456 on 1 and 48 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) -12.523065 13.342076
## x 1.890882 2.283952
## (Intercept) x
## 0.4095054 2.0874167
3.4.5 Accuracy of the model
s=RSE=√∑(yi−^Yi)2n−p=√RSSn−p
R2=1−RSSTSS=TSS−RSSTSS=1−RSSTSS
rxy=cov(X,Y)σxσy=∑ni=1(xi−ˉx)(yi−ˉy)√∑ni=1(xi−ˉx)2∑ni=1(yi−ˉy)2

Figure 3.3: Correlation and covariance for different scatter plots.
H0:β1=β2=β3=...=βp=0
H1:at least oneβ1≠0
The F-statistic for a linear model
F=(TSS−RSS)/(p−1)RSS/(n−p)=(TSS−RSS)/(p−1)RSE∼F(p−1,n−p)
3.4.6 Regression with categorical variables
set.seed(100)
gene1=rnorm(30,mean=4,sd=2)
gene2=rnorm(30,mean=2,sd=2)
gene.df=data.frame(exp=c(gene1,gene2),
group=c( rep(1,30),rep(0,30) ) )
mod2=lm(exp~group,data=gene.df)
summary(mod2)
##
## Call:
## lm(formula = exp ~ group, data = gene.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7290 -1.0664 0.0122 1.3840 4.5629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1851 0.3517 6.214 6.04e-08 ***
## group 1.8726 0.4973 3.765 0.000391 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.926 on 58 degrees of freedom
## Multiple R-squared: 0.1964, Adjusted R-squared: 0.1826
## F-statistic: 14.18 on 1 and 58 DF, p-value: 0.0003905
gene.df=data.frame(exp=c(gene1,gene2,gene2),
group=c( rep("A",30),rep("B",30),rep("C",30) )
)
mod3=lm(exp~group,data=gene.df)
summary(mod3)
##
## Call:
## lm(formula = exp ~ group, data = gene.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7290 -1.0793 -0.0976 1.4844 4.5629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0577 0.3781 10.731 < 2e-16 ***
## groupB -1.8726 0.5348 -3.502 0.000732 ***
## groupC -1.8726 0.5348 -3.502 0.000732 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.071 on 87 degrees of freedom
## Multiple R-squared: 0.1582, Adjusted R-squared: 0.1388
## F-statistic: 8.174 on 2 and 87 DF, p-value: 0.0005582