Home » Machine Learning » Clustering » Hierarchal Clustering: Dendrograms

Hierarchal Clustering: Dendrograms

Hierarchal clustering involves either agglomerative clustering (where we start with every sample in its own cluster and then gradually merge them together) or else divisive clustering (the samples start in a single cluster which we gradually split up).

Here we’ll examine agglomerative clustering.

There are a number of possible advantages of this approach.

For one thing, it allows us to create a dendrogram, which shows how clusters were gradually merged. If we do this on the iris data we can see that there are two big clusters in the data, one of which can then be split into two less-well-defined clusters.

The lengths of the lines in the dendrogram indicate how well-defined the clusters are.

We’ll use Scipy to create a dendrogram. The linkage function returns a table showing how clusters have gradually been merged into each other, and we can then display this using the dendrogram function.

import sklearn.datasets as ds
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

iris = ds.load_iris(as_frame=True)

df = iris['data']

X = StandardScaler().fit_transform(df)
dendrogram(linkage(X), truncate_mode='level', p=3)

I’ve truncated this using truncate_mode with p set to 3, so that we don’t see too much unnecessary detail in the dendrogram.

It’s worth at least skimming the documentation. Two important parameters to the linkage function are method and metric.

When we’re merging clusters, we need to measure distances between clusters in order to determine which clusters are the closest and should therefore be merged.

method determines how we do this. ‘single’ examines the distances between the nearest two points in clusters; ‘average’ examines the average distances between points in clusters, ‘complete’ looks at the distances between the furthest apart points in two clusters, while ‘ward’ takes a different approach entirely and examines how best to minimise variance by merging clusters.

All of these methods involve measuring the distances between points, except for ‘ward’. Therefore we usually need some metric for measuring distances between points. The most obvious way to do this is to use ‘euclidean’ distance, which basically applies Pythagoras’s theorem to measure distance in the way we’re most familiar with. However, other metrics are available. For instance, ‘cosine’ examines the angles lines between points and the origin make with each other.

Leave a Reply

Blog at WordPress.com.