- Often a data set consists of many different variables.
- Principal Components Analysis (PCA) provides a way to focus on the most important aspects of the data.
- Just as the name says, PCA determines the Principal Components of the data set.
Updated April 22, 2020
One major use of PCA in genomics is to simplify complex SNP data sets.
Consider a simple data set of two markers, M1 (A/G) and M2 (C/T). We can make a graphical representation of these markers by assigning numeric values to each genotype at each marker.
| M1 | M2 |
|---|---|
| AA: 0 | CC: 0 |
| AG: 1 | CT: 1 |
| GG: 2 | TT: 2 |
We can plot each individual’s genotypes on a 2D scatter plot:
| M1 | M2 |
|---|---|
| AA: 0 | CC: 0 |
| AG: 1 | CT: 1 |
| GG: 2 | TT: 2 |
note: points are “jittered” as a visual aid.
PCA identifies the vector through the data that contains the largest proportion of variance (i.e. the largest spread of data).
Where would you draw such a line here?
This vector represents the first principal component (PC1) and the contains the largest variance in the data:
In this data set the second principal component contains no information.
Thus principal components has simplified a 2D data set to a single dimension.
Consider a new marker, M3:
Where are the first and second principal components here?
Where are the first and second principal components here?
We can rotate the data to align the plot with the principal components
Now we have a single axis that represents the majority of the variation in the data, and a second axis that accounts for the remainder.
What if there are 3 SNPs?
Now we have 3 dimensions
In this view it appears that most of the variance in along a single vector.
demo live rotation of data cube
Changing rotation alters our interpretation of the data.
Now we see that we could draw 2 principal components that each would capture a fair bit of variance
Changing rotation alters our interpretation of the data.
This rotation shows a third axis of variation.
What do these PCs represent?
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| M1 | -0.71 | -0.03 | -0.70 |
| M2 | 0.03 | -1.00 | 0.01 |
| M3 | -0.70 | -0.01 | 0.71 |
How much variation is explained by each PC?
PC1 and PC2 capture almost all of the variance. We have converted our 3D data set into a 2D data set
A related technique is multi-dimensional scaling (MDS).
Determines the optimal projection to display the data in 2D
poor rotation 
good rotation 