Clustering
-
Adding to Clusters: Nearest Neighbours
Some clustering techniques allow you to fit models to data, and you can then feed whatever data you like to the model and it will try to classify your samples. For instance, the Scikit-learn k-means model has a fit method that lets you fit the model to some data. The fit method calculates the necessary…
-
Scikit-Learn Agglomerative Clustering
In the last post we saw how to create a dendrogram to analyse the clusters naturally present in our data. Now we’ll actually cluster the iris flower dataset using agglomerative clustering. Note that, although it doesn’t make a huge difference with the iris flower dataset, we will usually need to normalise the data (e.g. with…
-
Hierarchal Clustering: Dendrograms
Hierarchal clustering involves either agglomerative clustering (where we start with every sample in its own cluster and then gradually merge them together) or else divisive clustering (the samples start in a single cluster which we gradually split up). Here we’ll examine agglomerative clustering. There are a number of possible advantages of this approach. For one…
-
Clustering Irises with DBSCAN
In the previous post we looked at finding a value for the DBSCAN epsilon parameter, by examining distances to nearest neighbours in our data. It seemed that 0.75 might be a good value for epsilon, for the normalised data. Now we’ll actually use DBSCAN and try it out. DBSCAN has found only two clusters in…
-
Nearest Neighbours: Finding Epsilon for DBSCAN
We’ve seen that with the DBSCAN clustering algorithm, we need to determine a parameter called ‘epsilon‘. This is the radius of a circle that’s drawn around a point to determine how many other points are close to it in a cluster. If we take a step back and ask what clustering involves, at the most…
-
The DBSCAN Algorithm
DBSCAN is short for density-based spatial clustering of applications with noise, although I can’t help but suspect that words here have been chosen partly to fit the cool-sounding algorithm! It’s an algorithm for automatically clustering data, and unlike K-means clustering, you don’t have to specify the number of clusters in advance. DBSCAN will automatically find…
-
The Elbow Method
Various methods exist for trying to determine the optimal number of clusters in your data. The “elbow method” is typically used to determine the best number of clusters to use with KMeans clustering. Various possible numbers of clusters are tried, and for each number of clusters the “inertia” or SSE (Sum of Square Errors) is…
-
KMeans Clustering
Clustering involves dividing your data samples into clusters. For example, in the grapes.csv dataset that we’ve seen previously, there are clearly two kinds of grapes. They differ not only in colour, but also in weight, length and diameter. While there is some overlap, we might hope that a suitable clustering algorithm could divide these into…