Hierarchical Clustering Tutorial
Introduction
Hierarchical clustering is a type of unsupervised learning algorithm used to group similar objects into clusters. It builds a hierarchy of clusters either in a bottom-up approach (agglomerative) or a top-down approach (divisive). The end result is a tree-like structure called a dendrogram, which represents the nested grouping of objects and the order in which clusters are merged or split.
Types of Hierarchical Clustering
There are two main types of hierarchical clustering:
- Agglomerative Hierarchical Clustering: This is a bottom-up approach where each data point starts as its own cluster. Pairs of clusters are merged as one moves up the hierarchy.
- Divisive Hierarchical Clustering: This is a top-down approach where all data points start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Distance Metrics
In hierarchical clustering, the similarity between clusters is determined using a distance metric. Commonly used distance metrics include:
- Euclidean Distance: The straight-line distance between two points in Euclidean space.
- Manhattan Distance: The sum of the absolute differences between the coordinates of the points.
- Cosine Similarity: The cosine of the angle between two vectors.
Linkage Criteria
Linkage criteria determine how the distance between clusters is calculated. Common linkage criteria include:
- Single Linkage: The minimum distance between points in the two clusters.
- Complete Linkage: The maximum distance between points in the two clusters.
- Average Linkage: The average distance between points in the two clusters.
- Ward's Method: The increase in variance for the clusters being merged.
Implementing Hierarchical Clustering in Python
Let's implement hierarchical clustering using Python's scipy
and matplotlib
libraries.
Example
First, install the required libraries:
Next, let's write the code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
# Generate sample data
np.random.seed(42)
X = np.random.rand(10, 2)
# Perform hierarchical/agglomerative clustering
Z = linkage(X, 'ward')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
Dendrogram Interpretation
A dendrogram is a tree-like diagram that records the sequences of merges or splits. Here's how to interpret it:
- The y-axis represents the distance or dissimilarity between clusters.
- Each merge is represented by a horizontal line.
- The height of the horizontal line represents the distance at which the clusters were merged.
- The closer the merge is to the bottom, the more similar the clusters are.
Choosing the Optimal Number of Clusters
To determine the optimal number of clusters, you can use the dendrogram to find the largest vertical distance that does not intersect any horizontal line. This method is known as the "elbow method" in hierarchical clustering.
Conclusion
Hierarchical clustering is a powerful and versatile method for grouping similar objects into clusters. By understanding the different types, distance metrics, and linkage criteria, you can effectively apply hierarchical clustering to various datasets. Visualizing the results with a dendrogram can provide valuable insights into the structure and relationships within your data.