Clustering involves dividing your data samples into clusters. For example, in the grapes.csv dataset that we’ve seen previously, there are clearly two kinds of grapes.

They differ not only in colour, but also in weight, length and diameter. While there is some overlap, we might hope that a suitable clustering algorithm could divide these into two types of grape without even taking the colour into account.
One of the simplest and most powerful clustering techniques is KMeans clustering.
This involves the following steps (which are mostly performed for you by a suitable model implementation, such as the Scikit-learn KMeans model).
- Decide how many clusters you expect to obtain. With KMeans, this has to be specified in advance, although there are techniques for determining the optimum number of clusters automatically.
- Randomly assign each sample to a cluster.
- For each cluster, find the arithmetical average (mean) for each data feature. For example, if we are dealing with the weight, diameter and length of each grape and we have two clusters of grapes, we find the average weight, average diameter and average length for our two clusters.
We can think of these averages as points in a hyperspace, called centroids. So each cluster has a centroid, representing the average values for that cluster. - Re-assign each sample to the cluster with the nearest centroid.
- Repeat this process until the data samples have settled into stable clusters.
- It’s important to normalise your data.
- KMeans clustering works best on clusters that are roughly spherical.
- KMeans and clustering in general are forms of unsupervised learning. That is, we don’t present the model with some “predictor” data and the desired “target” result, expecting it to generalise, as if we were teaching words to a child. Instead, we present it only with the “predictor” data and ask it to find patterns in that data.
First let’s create a pairplot with Seaborn so we can visualise possible clusters.
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
df = pd.read_csv('grapes.csv')
# 67 is an outlier
df.drop(67, inplace=True)
sn.pairplot(df, hue="color", palette="husl")
plt.show()

In this case you case see that although the green grapes and purple grapes do occupy different areas of these 2D plots, the clusters are not close to being spherical. DBSCAN clustering might be more suited to this task. We can’t expect very good results with KMeans, but we can try it and see how it gets on.
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
df = pd.read_csv('grapes.csv')
# 67 is an outlier
df.drop(67, inplace=True)
X = df[['diameter', 'weight']]
X = StandardScaler().fit_transform(X)
true_clusters = df['color']
model = KMeans(n_clusters=2)
predicted_clusters = model.fit_predict(X)
print(true_clusters)
print(predicted_clusters)
score = normalized_mutual_info_score(true_clusters, predicted_clusters)
print(score)
0 purple
1 green
2 green
3 purple
4 purple
...
134 green
135 green
136 purple
137 green
138 green
Name: color, Length: 138, dtype: object
[1 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0
0 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 1 0 0
1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0]
0.46981896845225896
Here I’ve printed the true clusters (which we happen to know already), the predicted clusters, and the normalised mutual information score.
The score is only 0.47
You can see that the true clusters are labelled with “green” or “purple”, but the model produces only integers to denote different clusters. This presents us with a problem; how do we match the two and assess how well our model has performed? Does 0 represent purple, or green?
We can overcome this problem by using normalized_mutual_info_score. This gives a score of 0 if there’s no correlation between the two sets of scores, or 1.0 if there’s a perfect correlation. Let’s look at an example.
print(normalized_mutual_info_score([0, 1, 0, 1], [1, 0, 1, 0]))
print(normalized_mutual_info_score([0, 1, 0, 1], [0, 1, 0, 1]))
print(normalized_mutual_info_score([0, 1, 0, 1], [0, 0, 0, 0]))
1.0
1.0
0.0
In the first two examples we have perfect correlation. In the first example the numbers are not the same, but the pattern is the same, and that’s what matters. In the third example, there’s no correlation; knowing the numbers in the second list does not inform us about the numbers in the first list at all.
Visualizing Clusters
It can be interesting to plot actual clusters and predicted clusters side-by-side. Ideally for this we might use principle component analysis to reduce our three dimensions to two, but since we’ve not yet looked at that, we’ll plot weight and diameter.
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
df = pd.read_csv('grapes.csv')
# 67 is an outlier
df.drop(67, inplace=True)
X = df[['diameter', 'weight']]
X = StandardScaler().fit_transform(X)
true_clusters = df['color']
model = KMeans(n_clusters=2, )
predicted_clusters = model.fit_predict(X)
fig, ax = plt.subplots(nrows=1, ncols=2)
ax[0].set_title("True")
ax[0].set_xlabel("Weight (g)")
ax[0].set_ylabel("Diameter (mm)")
ax[1].set_title("Predicted")
ax[1].set_xlabel("Weight (g)")
ax[1].set_ylabel("Diameter (mm)")
ax[1].set_title("Predicted")
fig.suptitle("Grapes Measurements for Purple and Green Grapes")
sn.scatterplot(df, x='weight', y='diameter', hue=true_clusters, ax=ax[0])
sn.scatterplot(df, x='weight', y='diameter', hue=predicted_clusters, ax=ax[1])
plt.show()

We can see the algorithm has identified many green grapes as purple. These clusters are just not ideally shaped for KMeans. Ideally they’d be spherical and well-separated. But it’s not completely terrible either.
Leave a Reply