PCA example
Borrowed from Introduction to Statistical Learning, USArrests is a dataset that is part of base R. The dataset is for crime statistics for the 50 US states in 1973. It contains arrests per 100,000 for three categories of crime and also the percentage of urban population.
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
## Murder Assault UrbanPop Rape
## Murder 18.970465 291.0624 4.386204 22.99141
## Assault 291.062367 6945.1657 312.275102 519.26906
## UrbanPop 4.386204 312.2751 209.518776 55.76808
## Rape 22.991412 519.2691 55.768082 87.72916
The scale of these these is quite different, so lets center and scale the data:
According to the prescription, we should find the eigensystem:
## [1] 2.4802416 0.9897652 0.3565632 0.1734301
The largest eigenvalue (and hence first principle component) corresponds to the eigenvector:
## [1] -0.5358995 -0.5831836 -0.2781909 -0.5434321
verify its an eigenvector:
## [,1]
## Murder -0.5358995
## Assault -0.5831836
## UrbanPop -0.2781909
## Rape -0.5434321
Compare this to the prcomp method:
## PC1 PC2 PC3 PC4
## Murder -0.5358995 -0.4181809 0.3412327 0.64922780
## Assault -0.5831836 -0.1879856 0.2681484 -0.74340748
## UrbanPop -0.2781909 0.8728062 0.3780158 0.13387773
## Rape -0.5434321 0.1673186 -0.8177779 0.08902432
## [,1] [,2] [,3] [,4]
## [1,] -0.5358995 0.4181809 -0.3412327 0.64922780
## [2,] -0.5831836 0.1879856 -0.2681484 -0.74340748
## [3,] -0.2781909 -0.8728062 -0.3780158 0.13387773
## [4,] -0.5434321 -0.1673186 0.8177779 0.08902432
We could press on and compute the data coefficients manually, but in the interest of time, we can just use biplot
on the output from prcomp