Home » Machine Learning » Clustering » Nearest Neighbours: Finding Epsilon for DBSCAN

Nearest Neighbours: Finding Epsilon for DBSCAN

We’ve seen that with the DBSCAN clustering algorithm, we need to determine a parameter called ‘epsilon‘. This is the radius of a circle that’s drawn around a point to determine how many other points are close to it in a cluster.

If we take a step back and ask what clustering involves, at the most basic level it involves the idea that, given a collection of data samples (or points in a hyperspace), some samples form clusters in which the samples are closer to each other than they are to samples in other clusters.

For instance, the stars in our galaxy are closer to each other than they are to stars in other galaxies.

We could use various definitions of distances to measure closeness, but the most common definition might be the Euclidean distance.

Suppose we have the points (3, 4) and (1, 2). The Euclidean distance between these points is found like this: \sqrt{(3 - 1)^2 + (4 - 2)^2}

This formula gives us the ordinary everyday “distance” between two points in physical space, but it can also be used in abstract spaces with more than three dimensions.

A strategy for finding epsilon is then as follows.

First, compile a list of distances of the n nearest neighbours for each point. For example, if we have 2D data, meaning we have two values for each sample, we could choose n to be four (just doubling the number of dimensions).

Then we ask, for each sample, what is the distance to each of its four nearest neighbours.

If we plan to use the DBSCAN algorithm, the question will arise of whether these four nearest neighbours are within the circle we draw around each point.

Let’s then look at the furthest away of these four nearest neighbours. If this lies within the circle of radius epsilon, the other three nearest neighbours will too. If it doesn’t, we haven’t got four other samples within the circle; at most we’ve got three.

We’ll collect all of these distances for every sample, sort them and plot them on a graph.

What we hope to find is that as we increase epsilon, at some point the distance to the furthest of the four nearest neighbours increases dramatically, because to find the furthest of the four nearest neighbours, we have to go to a different cluster altogether. Before that we hope to find that this distance increases only slowly with increasing epsilon, because we remain within the cluster.

To actually implement this we’ll use Scikit-learn’s NearestNeighbors model.

Using NearestNeighbors

Take a look at the following Python program.

from sklearn.neighbors import NearestNeighbors
import pandas as pd

df = pd.DataFrame({'x': [1, -2, 3, 0], 'y': [3, 1, -2, 4]})
print(df)


nn = NearestNeighbors(n_neighbors=3)
nn.fit(df)

distances, indices = nn.kneighbors()
print()
print(distances)
print()
print(indices)
   x  y
0  1  3
1 -2  1
2  3 -2
3  0  4

[[1.41421356 3.60555128 5.38516481]
 [3.60555128 3.60555128 5.83095189]
 [5.38516481 5.83095189 6.70820393]
 [1.41421356 3.60555128 6.70820393]]

[[3 1 2]
 [0 3 2]
 [0 1 3]
 [0 1 2]]

We’ve started by creating some 2D data (two data series, meaning we have two Pandas dataframe columns).

For each sample (each row, or data point), we then find the three samples that are nearest to it, using Scikit-learn NearestNeighbors.

There are only four data samples in total, so for each point we will actually find all three of the other points in this case. Each point has only three neighbours.

The output produced by the kneighbors method of NearestNeighbors contains the distances to those neighbouring points, and the indices of those points, in order of distance.

Consider the first sample (or point). It’s at (1,3). The array of indices contains [3 1 2] for this sample. That tells us that the nearest point to this is at index 3 in the data; the second nearest point is at index 1, and the third nearest point is at index 2.

At index 3 we find the point (0,4). The Euclidean distances from (1,3) to (0,4) is \sqrt{(1-0)^2 + (3-4)^2}

This works out to be 1.414, which is the number listed in the first column of the first row of the distances output.

Plotting “Furthest Nearest Neighbour” for the Iris Flower Dataset

Now let’s calculate the distances of the n nearest neighbours for the Iris Flower dataset. We’ll set n to 8 (2 x 4) since there are 4 data series in this data. We’ll then sort the distances of the 8th nearest neighbours and plot them on a graph.

from sklearn.neighbors import NearestNeighbors
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

iris = load_iris(as_frame=True)

df = iris['data']

X = StandardScaler().fit_transform(df)

nn = NearestNeighbors(n_neighbors=8)
nn.fit(X)

distances, indices = nn.kneighbors()
sorted_distances = np.sort(distances, axis=0)

fig = plt.figure(figsize=(16,9))
fig.suptitle("Iris Flower Data Furthest Nearest Neighbour Distance")
ax = fig.add_subplot()
ax.set_xlabel("Sample number")
ax.set_ylabel("Distance to furthest nearest neighbor")
ax.plot(sorted_distances[:,7])
ax.axhline(y=0.75, linestyle='dashed')
plt.show()

I’ve also added a dashed line around the epsilon value where the average distance to the furthest of the 8 nearest neighbours starts to increase dramatically. This suggests a possible value for epsilon for use with DBSCAN.

In the next post we’ll try using this value for DBSCAN and see how well it clusters the iris flower data.

Notice I’ve also normalised the data with StandardScaler, which might not make much different for this dataset, but might make a huge difference with some datasets.

Leave a Reply

Blog at WordPress.com.

%d