12.2 Principal component analysis

PCA learning objectives

  • understand principal component analysis (pca)
  • make a visualization using pca
  • introduction to matrix decomposition

12.2.1 What are the steps to principal component analysis?

Steps to unsupervised dimensionality reduction principal component analysis:

  1. variables normalization (z=xˉxsd(x))
  2. highest variance component selection
  3. reduced variables linear combination

Given these indexes, with m<p:

  • i=1,...,m
  • j=1,...,p

This is our starting point, a n x p dataset X made of a series of Xj features, with j=1,...,p:

Xn,p=(x1,1x1,2x1,px2,1x2,2x2,pxn,1xn,2xn,p) Of which the first row (observation) is:

X1,X2,...,Xp The linear combination of Xj and βj, with j=1,...,p shown here is just for the first element with i=1:

β11xi1+β21xi2+...+βp1xip This is the linear combination of the original data.

The first step, as said, is the normalization of the observations: z=xˉxsd(x)

What we want is to visualize our features on a reduced dimensional space while extrapolating the highest level of the information from original data.

In order to do this, we reduce p to a lower value M: M<p to obtain a sample which is still representative of the level of variance in our data.

Our new Zm composition is made of M features, i.e. m=1,...,M:

Z1,Z2,...,Zm

To obtain a new linear combination:

Zm=pj=1ϕjmXj=ϕ1mX1+ϕ2mX2+...+ϕpmXp

And so, an approximation of the original features:

xijMm=1zimϕjm

We do this by normalizing these elements in the vector z=xˉxsd(x) to obtain a new vector of elements named loadings, (ϕ11,ϕ21,...,ϕp1)T. In particular when M=min(n1,p), we have: xij=Mm=1zimϕjm.

We constrain these lodings so their sum of squares would be equal to one:

pj=1ϕj12=1

The Euclidean distance:

“The first M principal component score vectors and the first M principal component loading vectors provide the best M-dimensional approximation (in terms of Euclidean distance) to the ith observation xij.”

min

The First pricipal component

Since we are only interested in variance, the total variance is defined as:

\sum_{j=1}^pVar(X_j)=\sum_{j=1}^p\frac{1}{n}\sum_{i=1}^nx_{ij}^2

The formula for the max sample variance explained by the mth principal component:

\frac{1}{n}\sum_{i=1}^nz_{im}^2=\frac{1}{n}\sum_{i=1}^n\left(\sum_{j=1}^p\phi_{jm}x_{ij}\right)^2

\max_{\phi_{11},\phi_{21},...,\phi_{p1}}\left \{\frac{1}{n}\sum_{i=1}^nz_{i1}^2\right \} \max_{\phi_{11},\phi_{21},...,\phi_{p1}}\left \{\frac{1}{n}\sum_{i=1}^n\left(\sum_{j=1}^p\phi_{j1}x_{ij}\right)^2\right \} subject \to\sum_{j=1}^p\phi_{j1}^2=1

The calculation of this objective pass through the eigen value decomposition which alternative is the singular value decomposition.

How much of the information in a given data set is lost by projecting the observations onto the first few principal components?

How much of the variance in the data is not contained in the first few principal components?

To answer this questions we need to consider the proportion of variance explained (PVE)

Maximizing the variance of the first M principal components, we minimize the mean squared error of the M-dimensional approximation, and viceversa.

In conclusion, principal component analysis is a question of minimizing the approximation error or maximizing the variance.