Outlier Detection | Data Cleaning | Datascience Tutorial

Introduction

Outliers are data points that differ significantly from other observations. They can occur due to variability in the data or due to experimental errors. Detecting outliers is crucial as they can significantly affect the results of data analysis and modeling.

Why Detect Outliers?

Outliers can skew and mislead the training process of machine learning models, leading to poor performance. They can also indicate data errors or novel phenomena, making it important to identify and analyze them.

Types of Outliers

Outliers can be classified into several types:

Global Outliers: Data points that deviate significantly from the rest of the dataset.
Contextual Outliers: Data points that are considered outliers in a specific context or condition.
Collective Outliers: A subset of data points that deviate significantly from the entire dataset.

Methods for Outlier Detection

There are various methods to detect outliers, such as:

Statistical Methods (e.g., Z-score, IQR)
Proximity-Based Methods (e.g., K-Nearest Neighbors)
Machine Learning Methods (e.g., Isolation Forest, One-Class SVM)

Statistical Methods

Z-score

The Z-score method assumes that the data follows a normal distribution. A Z-score indicates how many standard deviations a data point is from the mean. Data points with Z-scores greater than a certain threshold (e.g., 3 or -3) are considered outliers.

Example:

import numpy as np
from scipy import stats

data = [10, 12, 12, 13, 12, 11, 12, 13, 100]
z_scores = stats.zscore(data)
outliers = np.where(np.abs(z_scores) > 3)
print(outliers)

Output: (array([8]),)

Interquartile Range (IQR)

The IQR method is based on the spread of the middle 50% of the data. Data points that fall below the first quartile (Q1) or above the third quartile (Q3) by 1.5 times the IQR are considered outliers.

Example:

import numpy as np

data = [10, 12, 12, 13, 12, 11, 12, 13, 100]
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = [x for x in data if x < lower_bound or x > upper_bound]
print(outliers)

Output: [100]

Proximity-Based Methods

K-Nearest Neighbors (KNN)

KNN is a proximity-based method where the distance of a point from its k-nearest neighbors is calculated. If this distance is significantly larger than the average distance between neighbors, the point is considered an outlier.

Example:

from sklearn.neighbors import LocalOutlierFactor
import numpy as np

data = np.array([[10], [12], [12], [13], [12], [11], [12], [13], [100]])
clf = LocalOutlierFactor(n_neighbors=2)
y_pred = clf.fit_predict(data)
outliers = data[y_pred == -1]
print(outliers)

Output: [[100]]

Machine Learning Methods

Isolation Forest

Isolation Forest is an ensemble method specifically designed for anomaly detection. It isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Example:

from sklearn.ensemble import IsolationForest
import numpy as np

data = np.array([[10], [12], [12], [13], [12], [11], [12], [13], [100]])
clf = IsolationForest(contamination=0.1)
y_pred = clf.fit_predict(data)
outliers = data[y_pred == -1]
print(outliers)

Output: [[100]]

One-Class SVM

One-Class SVM is an unsupervised algorithm that learns a decision function for anomaly detection. It classifies new data as similar or different to the training set.

Example:

from sklearn.svm import OneClassSVM
import numpy as np

data = np.array([[10], [12], [12], [13], [12], [11], [12], [13], [100]])
clf = OneClassSVM(nu=0.1)
y_pred = clf.fit_predict(data)
outliers = data[y_pred == -1]
print(outliers)

Output: [[100]]

Conclusion

Outlier detection is a crucial step in data preprocessing. By identifying and handling outliers, you can improve the quality of your data and the performance of your models. This tutorial covered various methods for detecting outliers, including statistical methods, proximity-based methods, and machine learning methods. Each method has its strengths and weaknesses, and the choice of method depends on the specific characteristics of your data.