In this course we’ve created graphs of the well-known iris flower dataset repeatedly, but we were always faced with a frustrating choice.
Even though we’ve often used all four data series in the dataset to fit models, we could only plot two data series on a plot, because plots are 2D.
By using 3D plots (really fake 3D because computer monitors are just 2D), we could plot three data series, and that’s the limit. Beyond that we could alter the size and colour of points to get further data onto graphs, and after that we’re stuck.
It would be nice if we had a technique that could represent multi-dimensional data in two dimensions in the most optimal way. In other words, we’d like to condense four or more data series to only two data series.
Another use for such a technique could be reducing datasets containing many data series to only a few data series, enabling us to fit models far more efficiently.
Principal Component Analysis (PCA) is exactly such a technique.
How Does PCA Work?
Take a look at this scatter plot.
Every point has an x- and a y-coordinate. The red point simply represents the centroid; the point that’s at the average (mean) value of x and the average value of y.
Two axes are almost wasted here. Suppose we draw a line through the centroid, trying to fit the data as closely as possible.
Now we can mark the closest position of each sample on this line. Then instead of giving x- and y-coordinates for each sample, we just give the positions of the points we marked on this line.
We’ve reduced the dimensionality of the data from two dimensions to one, and we have actually captured most of the information that the original two coordinate sets represented.
This is how PCA works. The new data series we’ve created is called principal component 1.
If we want to capture all of the data, while continuing to use principal component 1 as our first axis, we can simply draw a second axis through the centroid, at right-angles to the first, then repeat the procedure. This second axis, principal component 2, is essentially telling us how far away each sample is from the axis we used to create the data series known as principal component 1.
Recall that variance is a measure of how far apart data samples are spread.
We can calculate a total variance for our data by measuring the variance of the x-coordinates in our original data, and adding it to the variance for the y-coordinates.
If we do the same for our new axes — our principal components — we find that the new total variance adds up to the same as the original total variance.
Total variance doesn’t change when you rotate axes like this, switching to new axes.
One important question is, how much of the total variance does principal component 1 capture?
We can express this as a percentage of total variance. This gives us the explained variance ratio of principal component 1, and this tells us how useful PCA has been in reducing the dimensionality of our data.
We can easily perform the above procedure for more than two dimensions. When we create each new principal component, we simply ensure it captures as much of the variance as possible.
For example, with three-dimensional data, when we draw our second principal component through the centroid, we have a choice of angles for it. We choose the angle that captures as much of the variance as possible.
Three-dimensional data would require three principal components to completely capture all of the variance in the data, but if two components capture 99% of the variance, maybe we don’t really need the third component.
In the next post we’ll look at a practical application of PCA, using Scikit-learn.