K Means Clustering | Clustering | Machine Learning Tutorial

Introduction

K-Means Clustering is an unsupervised machine learning algorithm that is used to partition a dataset into K distinct, non-overlapping subgroups or clusters. The goal of this algorithm is to minimize the variance within each cluster.

How K-Means Clustering Works

The K-Means algorithm follows these steps:

Specify the number of clusters K.
Initialize the centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
Repeat until convergence or a maximum number of iterations:
1. Assign each data point to the nearest centroid.
2. Compute the new centroids as the mean of the data points assigned to each cluster.

Choosing the Number of Clusters (K)

Choosing the right number of clusters is crucial and can be done using methods such as the Elbow Method or Silhouette Analysis.

Elbow Method: Plot the sum of squared distances from each point to its assigned cluster center for different values of K. The optimal K is usually where the plot shows an "elbow".

Example: K-Means Clustering in Python

Below is an example of how to implement K-Means Clustering in Python using the scikit-learn library.

First, make sure you have the required libraries:

pip install numpy pandas matplotlib scikit-learn

Example Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate synthetic data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Plot the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.show()

# Apply K-Means
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.show()

Advantages and Disadvantages

Advantages:

Simple and easy to implement.
Scales well with a large number of samples.
Works well with spherical clusters.

Disadvantages:

Requires the number of clusters K to be specified beforehand.
May converge to a local optimum.
Not suitable for clusters with non-convex shapes.

Conclusion

K-Means Clustering is a powerful algorithm for clustering data into distinct groups. While it has some limitations, it is widely used due to its simplicity and efficiency. Understanding how to implement and evaluate K-Means Clustering is a valuable skill in the field of machine learning.