Principal Component Analysis

We can quantify how 'spread' data is. If we think about how spread a dataset is, it makes sense to look at the mean and see how much all of the datapoints deviate from this. This forms the basis of how we compute the standard deviation σ and variance σ² of a dataset with a single independent variable. $$\sigma = \sqrt {\frac{\sum_{i=0}^n (X_i - \bar X)^2}{n}}$$ We alter this by dividing by n - 1 when dealing with sampled data in order to construct an unbiased estimator of the standard deviation. $$s = \sqrt {\frac{\sum_{i=0}^n (X_i - \bar X)^2}{n - 1}}$$ The unbiased estimator for variance is s² and is relevant for the covariance part of the covariance matrix which we will construct.

Play around with the slider to see how a distrubution of data on the right changes as a result of changing the variance!

Standard Deviation: 5

Covariance is a measure of the joint variability of two random variables, and so is quite analogous to variance. The sign of the covariance between two jointly distributed random variables describes the direction in which the linear relationship tends. i.e. if the covariance is positive, as one variable increases, so too does the other.

For two jointly distributed random variables X and Y, and where E[X] and E[Y] denote the 'expectation' of these variables, the covariance is calculated as follows: $$cov (X, Y) = E[X-E[X]] - E[Y-E[Y]]$$ We can do some expectation algebra to reduce it to a more easily computable form. $$E[X - E[X]] - E[Y - E[Y]] \\E[X - XE[Y] - XE[Y] + E[X]E[Y]] \\E[XY] - E[X]E[Y] - E[X]E[Y] + E[X]E[Y] \\E[XY] - E[X]E[Y] $$ So, when we compute the covariance matrix up above what we are really doing is computing this for each and every pair of random variable and populating the matrix appropriately i.e. pental length and sepal width. We actually use some linear algebra to compute the covariance matrix K_XY directly as follows: $$ K_{XY} = cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)^T] \\\mu_X = E[X] \\\mu_Y = E[Y] $$ E[X] and E[Y] respectively contain the expected values of X and Y.

Intuitively speaking, an eigenvector is a vector that remains unchanged when a linear transformation is applied to it.

i.e Let C be our square covariance matrix, ν a vector and λ a scalar that satisfies Cν = λν, then λ is an eigenvalue associated with eigenvector ν of A.

We can therefore compute the eigenvalues by solving the characteristic equation below for λ: $$ det(C - λI_n) = 0 $$ With each eigenvalue λ we can then compute its eigenvector(s) computing ν by gaussian elimination. $$ (C - λ)\cdotν = 0 $$ We started with the goal to reduce the dimensionality of our feature space, i.e., projecting the feature space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this new feature subspace. In order to decide which eigenvectors we want to drop for our lower-dimensional subspace, we have to take a look at the corresponding eigenvalues of the eigenvectors.

The eigenvectors with the lowest eigenvalues bear the least information about the distribution of data, so we rank eigenvectors by their corresponding eigenvalues.

Ordered Eigenvectors from the Fisher's Iris Dataset: $$ \begin{bmatrix} -0.52 \\ 0.26 \\ -0.57 \\ -0.52 \\ \end{bmatrix}, \begin{bmatrix} -0.37 \\ -0.93 \\ -0.07 \\ -0.37 \\ \end{bmatrix}, \begin{bmatrix} 0.72 \\ -0.24 \\ -0.63 \\ -0.72 \\ \end{bmatrix},... $$ We then choose the transpose of the augmented matrix formed of the top k eigenvectors to be our transformation matrix W that transforms our normalised original data onto the k dimensional subspace. So for when k = 2, $$ W_2^T = \begin{bmatrix} -0.52 & -0.37 \\ 0.26 & -0.93 \\ -0.57 & -0.07 \\ -0.52 & -0.37 \\ \end{bmatrix} $$ So if our normalised data is represented as matrix X, the plot we see above (when the number of principal components is 2) is the computed as follows: $$ X\cdot W_2 $$

What we've done so far is developed a transformation matrix V_k by sorting the first k number of eigenvalues associated to the eigenvecors from the eigendecomposition of the covariance matrix, and use this to project the normalised 4 dimensional Fischer-Iris dataset represented by X on k number of principal components.

i.e. we did the following $$ PC = V_k \cdot X $$ Now if we have PC and look to recompute X, we obviously do $$ X = V_k^{-1} \cdot PC $$ But as V, being a eigenvectors, is orthonormal (orthogonal because they are eigenvectors and normalised because we priorly normalised them), this is equivalent to $$ X = V_k^T \cdot PC $$ However we aren't done yet. As we had normalised the data earlier we need to denormalise it now so it's comparable to our original data. If σ is the standard deviation vector containing the standard deviations of every attribute of each datapoint, and μ likewise for the mean, the final reconstructed data will be computed as follows: $$ V_k^T \cdot PC \cdot \sigma + \mu = X \cdot \sigma + \mu $$ Remember when we picked the first k eigenvectors in constructing our transformation matrix - it's this which determines how much data we lose upon reconstruction!

Explained variance is a term we use to describe the proportion of the total variation in the data accounted for by a principal component. The eigenvalues themselves encode this as they come from the eigendecomposition of the covariance matrix.

If the kth principal component P_k has an associated eigenvalue λ_k, and we have n principal components, then the Explained Variance of P_k is $$ \frac{λ_k}{\sum_{i=0}^n λ_i} $$ i.e. calculating PC1's explained variance as above is done as follows: $$ \frac{2.91...}{2.91... + 0.147... + 0.921... + 0.021...} = 0.730... $$