12.2 Principal component analysis
PCA learning objectives
- understand principal component analysis (pca)
- make a visualization using pca
- introduction to matrix decomposition
12.2.1 What are the steps to principal component analysis?
Steps to unsupervised dimensionality reduction principal component analysis:
- variables normalization (\(z=\frac{x-\bar x}{sd(x)}\))
- highest variance component selection
- reduced variables linear combination
Given these indexes, with \(m < p\):
- \(i={1,...,m}\)
- \(j={1,...,p}\)
This is our starting point, a n x p dataset \(X\) made of a series of \(X_j\) features, with \(j={1,...,p}\):
\[ X_{n,p} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \] Of which the first row (observation) is:
\[X_{1},X_{2},...,X_{p}\] The linear combination of \(X_j\) and \(\beta_j\), with \(j={1,...,p}\) shown here is just for the first element with \(i=1\):
\[\beta_{11}x_{i1}+\beta_{21}x_{i2}+...+\beta_{p1}x_{ip}\] This is the linear combination of the original data.
The first step, as said, is the normalization of the observations: \[z=\frac{x-\bar x}{sd(x)}\]
What we want is to visualize our features on a reduced dimensional space while extrapolating the highest level of the information from original data.
In order to do this, we reduce \(p\) to a lower value \(M\): \(M < p\) to obtain a sample which is still representative of the level of variance in our data.
Our new \(Z_m\) composition is made of \(M\) features, i.e. \(m = 1,...,M\):
\[Z_1, Z_2, ..., Z_m\]
To obtain a new linear combination:
\[Z_m = \sum_{j=1}^p{\phi_{jm}X_j}=\phi_{1m}X_1+\phi_{2m}X_2+...+\phi_{pm}X_p\]
And so, an approximation of the original features:
\[x_{ij}\approx\sum_{m=1}^Mz_{im}\phi_{jm}\]
We do this by normalizing these elements in the vector \(z=\frac{x-\bar x}{sd(x)}\) to obtain a new vector of elements named loadings, \((\phi_{11},\phi_{21},...,\phi_{p1})^T\). In particular when \(M=min(n-1,p)\), we have: \(x_{ij}=\sum_{m=1}^Mz_{im}\phi_{jm}\).
We constrain these lodings so their sum of squares would be equal to one:
\[\sum_{j=1}^p{\phi_{j1}}^2=1\]
The Euclidean distance:
“The first M principal component score vectors and the first M principal component loading vectors provide the best M-dimensional approximation (in terms of Euclidean distance) to the ith observation xij.”
\[\min_{}\left \{\sum_{j=1}^p\sum_{i=1}^n\left(x_{ij}-\sum_{m=1}^Mz_{im}\phi_{jm}\right)^2\right \}\]
The First pricipal component
Since we are only interested in variance, the total variance is defined as:
\[\sum_{j=1}^pVar(X_j)=\sum_{j=1}^p\frac{1}{n}\sum_{i=1}^nx_{ij}^2\]
The formula for the max sample variance explained by the mth principal component:
\[\frac{1}{n}\sum_{i=1}^nz_{im}^2=\frac{1}{n}\sum_{i=1}^n\left(\sum_{j=1}^p\phi_{jm}x_{ij}\right)^2\]
\[\max_{\phi_{11},\phi_{21},...,\phi_{p1}}\left \{\frac{1}{n}\sum_{i=1}^nz_{i1}^2\right \}\] \[\max_{\phi_{11},\phi_{21},...,\phi_{p1}}\left \{\frac{1}{n}\sum_{i=1}^n\left(\sum_{j=1}^p\phi_{j1}x_{ij}\right)^2\right \}\] \[subject \to\sum_{j=1}^p\phi_{j1}^2=1\]
The calculation of this objective pass through the eigen value decomposition which alternative is the singular value decomposition.
How much of the information in a given data set is lost by projecting the observations onto the first few principal components?
How much of the variance in the data is not contained in the first few principal components?
To answer this questions we need to consider the proportion of variance explained (PVE)
Maximizing the variance of the first M principal components, we minimize the mean squared error of the M-dimensional approximation, and viceversa.
In conclusion, principal component analysis is a question of minimizing the approximation error or maximizing the variance.