In this post we’ll analyse the Iris Flower Dataset using principal component analysis and agglomerative clustering.
We’ll use PCA both to reduce the number of data series we’re feeding to our agglomerative clustering model (potentially making clustering more efficient, although in this case we’ve only got a total of four data series so it won’t make much difference) and also to reduce the dataset to two data series for plotting.
Some notes on the program below:
- When we create the PCA model, we set the number of principal components we want to 2, since that’s easiest for plotting.
- We then run agglomerative clustering not on the original data, but on the components returned by PCA.
- There is still a 1-to-1 correspondence between the new coordinates of the data along the principal component axes, and the original samples, so we can still obtain species information from the original data.
- We’re using slicing to select the data we want from the principal components, for plotting. For example, components[:,0] selects all the rows and the first column from the component data.
from sklearn.neighbors import NearestNeighbors from sklearn.preprocessing import StandardScaler from sklearn.cluster import AgglomerativeClustering from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_iris from sklearn.metrics import normalized_mutual_info_score from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA import matplotlib.pyplot as plt import seaborn as sn iris = load_iris(as_frame=True) df = iris['data'] df.columns = 'sepal length', 'sepal width', 'petal length', 'petal width' X = df y = iris['target'] pca = PCA(2) components = pca.fit_transform(X) ac = AgglomerativeClustering(n_clusters=3) ac.fit(components) y_predicted = ac.labels_ fig = plt.figure() fig.suptitle("Nearest Neighbours after Agglomerative Clustering on Principal Components Iris Flower Dataset") ax = fig.add_subplot(121) ax.set_xlabel("Principal Component 1") ax.set_ylabel("Principal Component 2") ax.set_title("True Clusters") sn.scatterplot(x=components[:,0], y=components[:,1], hue=y, palette='pastel', ax=ax) ax = fig.add_subplot(122) ax.set_xlabel("Principal Component 1") ax.set_ylabel("Principal Component 2") ax.set_title("Predicted Clusters") sn.scatterplot(x=components[:,0], y=components[:,1], hue=y_predicted, palette='pastel', ax=ax) plt.show() print("Mutual info score: ", normalized_mutual_info_score(y, y_predicted))
Mutual info score: 0.7776631579462301
Our normalised mutual information score here is almost 0.8, which is better than we obtained previously, without PCA. Visually, the clusters found by agglomerative clustering look really good.
We can ask how much of the total variance in the data is captured by these two principle components.
To find out we can simply print explained_variance_ratio_:
at the end of the program.
This gives us:
This tells us that the first principal component explains or captures 0.92% of the variance present in the original data, and the second principal component explains a further 0.053%.
Altogether, these two components actually capture over 97% of the variance of the data. Most of what’s going on in the data can therefore be described by just two data series.