Clustering

  • Adding to Clusters: Nearest Neighbours

    Some clustering techniques allow you to fit models to data, and you can then feed whatever data you like to the model and it will try to classify your samples. For instance, the Scikit-learn k-means model has a fit method that lets you fit the model to some data. The fit method calculates the necessary…

    Read more

  • Scikit-Learn Agglomerative Clustering

    In the last post we saw how to create a dendrogram to analyse the clusters naturally present in our data. Now we’ll actually cluster the iris flower dataset using agglomerative clustering. Note that, although it doesn’t make a huge difference with the iris flower dataset, we will usually need to normalise the data (e.g. with…

    Read more

  • Hierarchal Clustering: Dendrograms

    Hierarchal clustering involves either agglomerative clustering (where we start with every sample in its own cluster and then gradually merge them together) or else divisive clustering (the samples start in a single cluster which we gradually split up). Here we’ll examine agglomerative clustering. There are a number of possible advantages of this approach. For one…

    Read more

  • Clustering Irises with DBSCAN

    In the previous post we looked at finding a value for the DBSCAN epsilon parameter, by examining distances to nearest neighbours in our data. It seemed that 0.75 might be a good value for epsilon, for the normalised data. Now we’ll actually use DBSCAN and try it out. DBSCAN has found only two clusters in…

    Read more

  • Nearest Neighbours: Finding Epsilon for DBSCAN

    We’ve seen that with the DBSCAN clustering algorithm, we need to determine a parameter called ‘epsilon‘. This is the radius of a circle that’s drawn around a point to determine how many other points are close to it in a cluster. If we take a step back and ask what clustering involves, at the most…

    Read more

  • The DBSCAN Algorithm

    DBSCAN is short for density-based spatial clustering of applications with noise, although I can’t help but suspect that words here have been chosen partly to fit the cool-sounding algorithm! It’s an algorithm for automatically clustering data, and unlike K-means clustering, you don’t have to specify the number of clusters in advance. DBSCAN will automatically find…

    Read more

  • The Elbow Method

    Various methods exist for trying to determine the optimal number of clusters in your data. The “elbow method” is typically used to determine the best number of clusters to use with KMeans clustering. Various possible numbers of clusters are tried, and for each number of clusters the “inertia” or SSE (Sum of Square Errors) is…

    Read more

  • KMeans Clustering

    Clustering involves dividing your data samples into clusters. For example, in the grapes.csv dataset that we’ve seen previously, there are clearly two kinds of grapes. They differ not only in colour, but also in weight, length and diameter. While there is some overlap, we might hope that a suitable clustering algorithm could divide these into…

    Read more

Blog at WordPress.com.