Swiftorial Logo
Home
Swift Lessons
Matchups
CodeSnaps
Tutorials
Career
Resources

Unsupervised Learning Tutorial

Introduction to Unsupervised Learning

Unsupervised learning is a type of machine learning that deals with data that has no labeled responses. The goal is to infer the natural structure present within a set of data points. This is in contrast to supervised learning, where the goal is to learn a mapping from inputs to outputs based on example input-output pairs.

Types of Unsupervised Learning

There are several types of unsupervised learning techniques:

  • Clustering: Grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.
  • Dimensionality Reduction: Reducing the number of random variables under consideration by obtaining a set of principal variables.
  • Anomaly Detection: Identifying rare items, events, or observations which raise suspicions by differing significantly from the majority of the data.

Clustering with K-Means

K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Example: K-Means Clustering

Let's consider a simple example of K-Means clustering using Python's scikit-learn library.

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# Create KMeans instance with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Predict the cluster for each data point
predictions = kmeans.predict(X)

# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=predictions, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', label='Centroids')
plt.title('K-Means Clustering')
plt.legend()
plt.show()
                    

The above code will produce a scatter plot visualizing the clustered data points and the centroids of each cluster.

Dimensionality Reduction with PCA

Principal Component Analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

Example: PCA

Let's see an example of applying PCA using scikit-learn.

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
X = np.random.randn(100, 3)

# Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
X_r = pca.fit_transform(X)

# Visualize the reduced data
plt.scatter(X_r[:, 0], X_r[:, 1])
plt.title('PCA Dimensionality Reduction')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
                    

This code will produce a scatter plot showing the data points reduced to two principal components.

Anomaly Detection with Isolation Forest

Isolation Forest is an algorithm for anomaly detection that isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Example: Isolation Forest

The following example demonstrates anomaly detection using the Isolation Forest algorithm.

import numpy as np
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt

# Generate synthetic data
X = np.random.rand(100, 2)

# Introduce some outliers
X = np.vstack([X, [3, 3], [3, 4], [4, 4]])

# Apply Isolation Forest
clf = IsolationForest(random_state=0).fit(X)
predictions = clf.predict(X)

# Visualize the anomalies
plt.scatter(X[:, 0], X[:, 1], c=predictions, cmap='coolwarm')
plt.title('Isolation Forest Anomaly Detection')
plt.show()
                    

This code will produce a scatter plot showing the data points, with anomalies highlighted in a different color.

Conclusion

Unsupervised learning is a powerful tool for discovering patterns and structures in data. Techniques like clustering, dimensionality reduction, and anomaly detection are widely used in various fields such as data mining, computer vision, and bioinformatics. Understanding these techniques and their applications is crucial for any data scientist or machine learning practitioner.