DBSCAN is short for density-based spatial clustering of applications with noise, although I can’t help but suspect that words here have been chosen partly to fit the cool-sounding algorithm!
It’s an algorithm for automatically clustering data, and unlike K-means clustering, you don’t have to specify the number of clusters in advance. DBSCAN will automatically find them for you, and classify any remaining points (points not in clusters) as “noise”.
How Does DBSCAN Work?
We start by picking a point p. Draw a circle around that point (or, a sphere if we’re dealing with 3-dimensional data, or a hypersphere if we have more dimensions). The radius of the sphere we’ll call ε, the Greek letter epsilon.
We’ll have to choose a value for ε. In other words, we’ll have to choose how big to draw our circle.
Now the question is, how many points are in that circle? If it’s equal to or greater than some number n (which we also have to choose), we’ll say that our original point is a core point.
Typically we’ll choose n to be twice the number of dimensions in our data. If our data is two dimensional (it contains two data series, meaning we have two values for every data sample), then we might choose n to be 4.
All the points within the circle, we say are directly reachable from this core point p.
Now we can look at each of the points in the circle and ask if each of those points is itself a core point. In other words, if we take each of those points in turn and draw a circle around it, are there at least n points within that circle?
Often we can now connect many other points to our original point by hopping from core point to core point. While only the points within the circle around p are directly reachable from p, many other points are still reachable from p in this way.
We only consider a point to be in the same cluster as p if it’s reachable from p.
Since this explanation is a bit involved, let’s consider an analogy involving clusters of people in a crowd.
We start with a person, whom we’ll call Bob.
If Bob stretches out his arm, how many other people can he touch? If the answer is that he can touch four other people (the people in front of him, behind him, and to his left and right side, let’s say), we’ll say Bob is a core person.
Now we look at each of these four people in turn and ask whether each of those can also touch at least four other people. Maybe we’ll find that only one of them can manage that: Alice, who’s standing in front of Bob. Alice is therefore also a core person, and the others aren’t. One of the four people Alice can touch is Peter, but Peter can touch only two other people — Alice and someone else — so Peter is not a core person.
Alice is directly reachable from Bob, and Peter is reachable from Bob but not directly reachable. So Alice, Bob and Peter are all in the same cluster. Actually, all of the people that Bob can touch plus all of the people that Alice can touch are all in the same cluster.
For any two people to be considered to be in the same cluster, they have to be able to all touch each other in a long chain. Also, every person in the chain must be able to touch at least four other people — except for the very last person in the chain.