12.5 The matrix decomposition

The singular value decomposition (SVD):

the svd() function returns three components, u, d, and v.

sX <- svd(X)
names(sX)
## [1] "d" "u" "v"
round(sX$v, 3)
##        [,1]   [,2]   [,3]   [,4]
## [1,] -0.536 -0.418  0.341  0.649
## [2,] -0.583 -0.188  0.268 -0.743
## [3,] -0.278  0.873  0.378  0.134
## [4,] -0.543  0.167 -0.818  0.089

v is equivalent to the loadings, u is equivalent to the standardized scores, and d is the matrix of the standard deviations.

t(sX$d * t(sX$u)) %>% head
##            [,1]       [,2]        [,3]         [,4]
## [1,] -0.9756604 -1.1220012  0.43980366  0.154696581
## [2,] -1.9305379 -1.0624269 -2.01950027 -0.434175454
## [3,] -1.7454429  0.7384595 -0.05423025 -0.826264240
## [4,]  0.1399989 -1.1085423 -0.11342217 -0.180973554
## [5,] -2.4986128  1.5274267 -0.59254100 -0.338559240
## [6,] -1.4993407  0.9776297 -1.08400162  0.001450164
pcob$x %>% head
##             PC1        PC2         PC3          PC4
## [1,] -0.9756604 -1.1220012  0.43980366  0.154696581
## [2,] -1.9305379 -1.0624269 -2.01950027 -0.434175454
## [3,] -1.7454429  0.7384595 -0.05423025 -0.826264240
## [4,]  0.1399989 -1.1085423 -0.11342217 -0.180973554
## [5,] -2.4986128  1.5274267 -0.59254100 -0.338559240
## [6,] -1.4993407  0.9776297 -1.08400162  0.001450164

12.5.1 Matrix Completion

Sometimes you want to fill in NAs intelligently.

Technique

  • Start with mean imputation per column.
  • Use the computed PCA data to impute values.
  • Recompute PCA and repeat.
  • Technically they use svd() (singular-value decomposition) in the lab, which is called inside the prcomp() function, to more directly demonstrate what’s happening.

Set up

  • First we set up a matrix with missing values.
  • The code for this is in the book and not particularly interesting, but I’ve made the names suck less.
  • I also don’t scale, because their package does this internally.
arrests <- data.matrix(USArrests)

n_omit <- 20
set.seed(15)
target_rows <- sample(seq(50), n_omit)
target_cols <- sample(1:4, n_omit, replace = TRUE)
targets <- cbind(target_rows, target_cols)
head(targets, 2)
##      target_rows target_cols
## [1,]          37           3
## [2,]          47           1
arrests_na <- arrests
arrests_na[targets] <- NA
head(arrests_na, 2)
##      state Murder Assault UrbanPop Rape
## [1,]     1     NA     236       58 21.2
## [2,]     2     10      NA       48 44.5
is_missing <- is.na(arrests_na)

The {softImpute} package to do this, let’s use it!

fit_svd <- softImpute::softImpute(
  arrests_na, 
  type = "svd",
  thresh = 1e-16,
  maxit = 3000
)
arrests_imputed <- softImpute::complete(arrests_na, fit_svd, unscale = TRUE)
cor(arrests_imputed[is_missing], arrests[is_missing])
## [1] 0.7249977