Dbscan | Unsupervised Learning | Machine Learning Tutorial

Introduction

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm used in machine learning and data mining. It groups together points that are close to each other based on a distance measurement and a minimum number of points. Unlike other clustering algorithms like K-Means, DBSCAN does not require the number of clusters to be specified beforehand and can identify outliers that do not belong to any cluster.

Concepts and Parameters

DBSCAN has two main parameters:

eps: The maximum distance between two points for them to be considered as in the same neighborhood.
min_samples: The minimum number of points required to form a dense region (core point).

Based on these parameters, DBSCAN classifies points into three categories:

Core points: Points that have at least min_samples points within eps distance.
Border points: Points that are within eps distance of a core point but do not have enough neighbors to be a core point themselves.
Noise points: Points that are neither core points nor border points.

Algorithm Steps

The DBSCAN algorithm works as follows:

Start with an arbitrary point that has not been visited.
Retrieve the neighborhood of this point using eps.
If the neighborhood contains at least min_samples points, mark this point as a core point and create a cluster. If not, mark this point as noise.
If a point is found to be a core point, iterate through its neighborhood points and repeat the process for each point that has not been visited.
Continue the process until all points have been visited.

Example Implementation

Let's see how DBSCAN can be implemented using Python and the scikit-learn library. We will use a simple dataset to demonstrate the clustering process.

Install scikit-learn if you haven't already:

pip install scikit-learn

Here is a complete example:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, _ = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0)

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print(f'Estimated number of clusters: {n_clusters_}')

# Plot result
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title(f'Estimated number of clusters: {n_clusters_}')
plt.show()

In this example:

We generate sample data using make_blobs with three centers.
We apply the DBSCAN algorithm with eps=0.3 and min_samples=10.
We plot the resulting clusters, with noise points shown in black.

Advantages and Disadvantages

DBSCAN has several advantages:

It does not require the number of clusters to be specified beforehand.
It can find arbitrarily shaped clusters.
It is robust to outliers.

However, DBSCAN also has some disadvantages:

It may struggle with clusters of varying density, as the same eps and min_samples may not apply to all clusters.
It can be computationally expensive for large datasets.

Conclusion

DBSCAN is a powerful and versatile clustering algorithm suitable for many types of data. It is particularly useful when the number of clusters is unknown and when dealing with noisy data. By understanding its parameters and how it works, you can effectively apply DBSCAN to your own datasets to discover meaningful patterns and structures.