K-Means Clustering Tutorial
Introduction
K-Means Clustering is an unsupervised machine learning algorithm that is used to partition a dataset into K distinct, non-overlapping subgroups or clusters. The goal of this algorithm is to minimize the variance within each cluster.
How K-Means Clustering Works
The K-Means algorithm follows these steps:
- Specify the number of clusters K.
- Initialize the centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
- Repeat until convergence or a maximum number of iterations:
- Assign each data point to the nearest centroid.
- Compute the new centroids as the mean of the data points assigned to each cluster.
Choosing the Number of Clusters (K)
Choosing the right number of clusters is crucial and can be done using methods such as the Elbow Method or Silhouette Analysis.
Example: K-Means Clustering in Python
Below is an example of how to implement K-Means Clustering in Python using the scikit-learn
library.
First, make sure you have the required libraries:
pip install numpy pandas matplotlib scikit-learn
Example Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Generate synthetic data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Plot the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.show()
# Apply K-Means
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.show()
Advantages and Disadvantages
Advantages:
- Simple and easy to implement.
- Scales well with a large number of samples.
- Works well with spherical clusters.
Disadvantages:
- Requires the number of clusters K to be specified beforehand.
- May converge to a local optimum.
- Not suitable for clusters with non-convex shapes.
Conclusion
K-Means Clustering is a powerful algorithm for clustering data into distinct groups. While it has some limitations, it is widely used due to its simplicity and efficiency. Understanding how to implement and evaluate K-Means Clustering is a valuable skill in the field of machine learning.