In the previous post we looked at finding a value for the DBSCAN epsilon parameter, by examining distances to nearest neighbours in our data.
It seemed that 0.75 might be a good value for epsilon, for the normalised data.
Now we’ll actually use DBSCAN and try it out.
from sklearn.cluster import DBSCAN import pandas as pd import numpy as np from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt import seaborn as sn iris = load_iris(as_frame=True) df = iris['data'] X = StandardScaler().fit_transform(df) y = np.choose(iris['target'], iris['target_names']) model = DBSCAN(eps=0.75, min_samples=8) model.fit(X) y_predicted = model.labels_ plot_x = 0 plot_y = 2 x_col = df.columns[plot_x] y_col = df.columns[plot_y] fig, axes = plt.subplots(ncols=2) fig.suptitle("Iris Flower DBSCAN Clustering") axes.set_xlabel(x_col) axes.set_ylabel(y_col) axes.set_xlabel(x_col) axes.set_ylabel(y_col) sn.scatterplot(data=df, x=x_col, y=y_col, hue=y, palette='husl', ax=axes) sn.scatterplot(data=df, x=x_col, y=y_col, hue=y_predicted, palette='husl', ax=axes) plt.show()
DBSCAN has found only two clusters in the iris data with these parameters. Indeed, Iris virginica and Iris versicolor are very similar to each other.
We can try to tweak our parameters, but first, some notes about the program.
- DBSCAN assigns positive integers to clusters, and -1 to samples that it considers noise (not part of any cluster). So here it has found 2 clusters, plus a few noise samples.
- I’ve chosen min_samples to be 8, just because we have four columns in our data and twice the number of data series / columns is often a good value for min_samples.
- I’ve obtained the column names from the data frame rather than writing them out, because that’s easer and less error-prone with long names. So plot_x is the column index of the data series we’re plotting on the x-axis, and plot_y is for the y-axis.
If we have too few clusters in our data, that suggests the epsilon parameter might be too large. The circles that DBSCAN conceptually draws around points are then too large, and include points of different clusters within the same cluster.
A value of 0.53 does slightly better, although it creates too many clusters.
Overall I think K-means clustering does better with the iris data, because it allows us to specify the number of clusters. Where DBSCAN really excels is with irregularly-shaped clusters (even very irregularly-shaped) that are well-separated.