PCA for the Iris Flower Dataset

In this post we’ll analyse the Iris Flower Dataset using principal component analysis and agglomerative clustering.

We’ll use PCA both to reduce the number of data series we’re feeding to our agglomerative clustering model (potentially making clustering more efficient, although in this case we’ve only got a total of four data series so it won’t make much difference) and also to reduce the dataset to two data series for plotting.

Some notes on the program below:

When we create the PCA model, we set the number of principal components we want to 2, since that’s easiest for plotting.
We then run agglomerative clustering not on the original data, but on the components returned by PCA.
There is still a 1-to-1 correspondence between the new coordinates of the data along the principal component axes, and the original samples, so we can still obtain species information from the original data.
We’re using slicing to select the data we want from the principal components, for plotting. For example, components[:,0] selects all the rows and the first column from the component data.

from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import normalized_mutual_info_score
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sn


iris = load_iris(as_frame=True)

df = iris['data']
df.columns = 'sepal length', 'sepal width', 'petal length', 'petal width'

X = df
y = iris['target']

pca = PCA(2)
components = pca.fit_transform(X)

ac = AgglomerativeClustering(n_clusters=3)
ac.fit(components)

y_predicted = ac.labels_

fig = plt.figure()

fig.suptitle("Nearest Neighbours after Agglomerative Clustering on Principal Components Iris Flower Dataset")

ax = fig.add_subplot(121)
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_title("True Clusters")
sn.scatterplot(x=components[:,0], y=components[:,1], hue=y, palette='pastel', ax=ax)

ax = fig.add_subplot(122)
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_title("Predicted Clusters")
sn.scatterplot(x=components[:,0], y=components[:,1], hue=y_predicted, palette='pastel', ax=ax)

plt.show()

print("Mutual info score: ", normalized_mutual_info_score(y, y_predicted))

Mutual info score:  0.7776631579462301

Our normalised mutual information score here is almost 0.8, which is better than we obtained previously, without PCA. Visually, the clusters found by agglomerative clustering look really good.

Explained Variance

We can ask how much of the total variance in the data is captured by these two principle components.

To find out we can simply print explained_variance_ratio_:

print(pca.explained_variance_ratio_)

at the end of the program.

This gives us:

[0.92461872 0.05306648]

This tells us that the first principal component explains or captures 0.92% of the variance present in the original data, and the second principal component explains a further 0.053%.

Altogether, these two components actually capture over 97% of the variance of the data. Most of what’s going on in the data can therefore be described by just two data series.

Cave of Python

Numpy

Graphs and Charts

Pandas

Regression

Clustering

Other Useful Techniques

Artificial Neural Networks (ANNs)