Various methods exist for trying to determine the optimal number of clusters in your data. The “elbow method” is typically used to determine the best number of clusters to use with KMeans clustering.
Various possible numbers of clusters are tried, and for each number of clusters the “inertia” or SSE (Sum of Square Errors) is calculated.
This is the sum of the squares of the distances from each data sample to its centroid.
Often we find that this number drops rapidly as we approach the optimum number of clusters, then after that declines much more slowly, creating an “elbow” in the graph of number of clusters vs. inertia.
If I try this with my grapes.csv data (2 clusters), the results aren’t all that great, but you can see that 2 clusters looks like it would work.

from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import normalized_mutual_info_score
df = pd.read_csv('grapes.csv')
# 67 is an outlier
df.drop(67, inplace=True)
X = df[['diameter', 'length']]
X = StandardScaler().fit_transform(X)
inertia = []
clusters = range(1, 6)
for n in clusters:
model = KMeans(n, n_init='auto')
model.fit_predict(X)
inertia.append(model.inertia_)
fig, ax = plt.subplots()
ax.set_xlabel("Clusters")
ax.set_ylabel("Inertia")
ax.plot(clusters, inertia)
plt.show()

I’ve set the n_init model parameter here, just because otherwise the model currently issues warnings. This parameter controls the number of times the model generates random centroids before selecting the best run.
With the iris flower dataset, we can see there are 2-3 natural clusters. Two of the three iris species in the data are actually quite hard to tell apart using only petal and sepal measurements, and indeed there aren’t really three clearly defined clusters in the data.
You can see there is a pretty clear ‘elbow’ around 2 or 3 clusters, where inertia begins to decline much less steeply than previously.
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import normalized_mutual_info_score
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
df = iris['data']
# 67 is an outlier
df.drop(67, inplace=True)
X = df.iloc[:,0:4]
X = StandardScaler().fit_transform(X)
inertia = []
clusters = range(1, 10)
for n in clusters:
model = KMeans(n, n_init='auto')
model.fit_predict(X)
inertia.append(model.inertia_)
fig, ax = plt.subplots()
fig.suptitle("SSE/Inertia for Iris Flower Dataset")
ax.set_xlabel("Clusters")
ax.set_ylabel("Inertia")
ax.plot(clusters, inertia)
plt.show()

Why Does the Elbow Method Work?
If you think about it, just adding in more centroids is bound to decrease the average distance of data samples to the nearest centroid, in the same way that adding in more supermarkets to a region decreases the average distance people have to travel to a supermarket. But we only get a very steep decline when we’re approaching an optimal number of clusters. For example, if you have three villages separated by some distance, adding in three supermarkets would rapidly decrease the distance people have to travel to a supermarket; after that adding in more supermarkets only results in a gradual decline in average distance travelled.
We use the square of the distances from data samples to centroids to eliminate negative values, which would otherwise cancel out positive values.
Leave a Reply