Home » Machine Learning » Clustering » Scikit-Learn Agglomerative Clustering

Scikit-Learn Agglomerative Clustering

In the last post we saw how to create a dendrogram to analyse the clusters naturally present in our data. Now we’ll actually cluster the iris flower dataset using agglomerative clustering.

Note that, although it doesn’t make a huge difference with the iris flower dataset, we will usually need to normalise the data (e.g. with StandardScaler) before performing hierarchal clustering, otherwise it may not work well at all.

The reason it doesn’t make much difference with the iris flower dataset is that all the measurements it contains are similar to start with anyway. They are all measurements in centimetres and all exist on a similar scale.

We’ll also print a normalised mutual information score, so we get an indication of how well clustering has been performed.

import sklearn.datasets as ds
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
from sklearn.metrics import normalized_mutual_info_score
import seaborn as sn

iris = ds.load_iris(as_frame=True)

df = iris['data']
y = iris['target']

# Creates a dendrogram so we can figure out how many
# clusters might be useful.
X = StandardScaler().fit_transform(df)
dendrogram(linkage(X), truncate_mode='level', p=3)

model = AgglomerativeClustering(n_clusters=3)
model.fit(X)
y_predicted = model.labels_

plot_x = 0
plot_y = 1
x_label = df.columns[plot_x]
y_label = df.columns[plot_y]

fig, axes = plt.subplots(ncols=2)

fig.suptitle("Agglomerative Clustering Iris Flower Dataset")

ax = axes[0]
ax.set_xlabel(x_label)
ax.set_xlabel(y_label)
ax.set_title("True Clusters")

sn.scatterplot(data=df, x=x_label, y=y_label, hue=y, palette='pastel', ax=ax)

ax = axes[1]
ax.set_xlabel(x_label)
ax.set_xlabel(y_label)
ax.set_title("Predicted Clusters")
sn.scatterplot(data=df, x=x_label, y=y_label, hue=y_predicted, palette='pastel', ax=ax)

plt.show()

print(normalized_mutual_info_score(y, y_predicted))
0.6754701853436886

As we can see from the scatter plots and the mutual information score, clustering has been peformed reasonably well. As is clear from the dendrogram, it’s hard to split the iris data into three clusters; two clusters is much easier. The data contains two iris species that are really similar in terms of sepal and petal lengths and widths.

We can do slightly better here if we set the metric to ‘cosine’ and the method to ‘complete’.

The corresponding parameters in the Scikit-learn model are called metric and linkage.

import sklearn.datasets as ds
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
from sklearn.metrics import normalized_mutual_info_score
import seaborn as sn

iris = ds.load_iris(as_frame=True)

df = iris['data']
y = iris['target']

# Creates a dendrogram so we can figure out how many
# clusters might be useful.
X = StandardScaler().fit_transform(df)
dendrogram(linkage(X, metric='cosine', method='complete'), truncate_mode='level', p=3)

model = AgglomerativeClustering(n_clusters=3, metric='cosine', linkage='complete')
model.fit(X)
y_predicted = model.labels_

plot_x = 2
plot_y = 3
x_label = df.columns[plot_x]
y_label = df.columns[plot_y]

fig, axes = plt.subplots(ncols=2)

fig.suptitle("Agglomerative Clustering Iris Flower Dataset")

ax = axes[0]
ax.set_xlabel(x_label)
ax.set_xlabel(y_label)
ax.set_title("True Clusters")

sn.scatterplot(data=df, x=x_label, y=y_label, hue=y, palette='pastel', ax=ax)

ax = axes[1]
ax.set_xlabel(x_label)
ax.set_xlabel(y_label)
ax.set_title("Predicted Clusters")
sn.scatterplot(data=df, x=x_label, y=y_label, hue=y_predicted, palette='pastel', ax=ax)

plt.show()

print(normalized_mutual_info_score(y, y_predicted))
0.6851506723756673

Leave a Reply

Blog at WordPress.com.

%d