Feature Scaling | Feature Engineering

Introduction to Feature Scaling

Feature scaling is a crucial step in the preprocessing of data before applying machine learning algorithms. It involves transforming the data into a specified range, typically to improve the performance and training stability of the model. Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.

Why is Feature Scaling Important?

Feature scaling is essential for several reasons:

Algorithms Sensitive to Scale: Algorithms like Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) are sensitive to the scale of the data.
Gradient Descent Convergence: Gradient descent converges faster with scaled features.
Improved Accuracy: Inconsistent scales can lead to poor model accuracy.

Types of Feature Scaling

There are several methods to scale features, including:

Min-Max Scaling: Scales the data to a fixed range, usually 0 to 1.
Standardization (Z-score normalization): Centers the data to have a mean of 0 and a standard deviation of 1.
Robust Scaling: Uses the median and interquartile range for scaling and is robust to outliers.
MaxAbs Scaling: Scales each feature by its maximum absolute value.

Min-Max Scaling

Min-Max Scaling transforms features by scaling them to a given range, usually 0 to 1. The formula is:

X' = (X - X_min) / (X_max - X_min)

Example:

Let's say we have a feature with values ranging from 10 to 50. We can scale it to the range 0 to 1:

import numpy as np
from sklearn.preprocessing import MinMaxScaler
                    
data = np.array([[10], [20], [30], [40], [50]])
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]

Standardization (Z-score normalization)

Standardization scales the data to have a mean of 0 and a standard deviation of 1. The formula is:

X' = (X - μ) / σ

Example:

Let's standardize a feature with values:

import numpy as np
from sklearn.preprocessing import StandardScaler
                    
data = np.array([[1], [2], [3], [4], [5]])
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print(standardized_data)

[[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]

Robust Scaling

Robust Scaling uses statistics that are robust to outliers. The formula is:

X' = (X - X_median) / IQR

Example:

Let's scale a feature with outliers using Robust Scaling:

import numpy as np
from sklearn.preprocessing import RobustScaler
                    
data = np.array([[1], [2], [3], [4], [100]])
scaler = RobustScaler()
robust_scaled_data = scaler.fit_transform(data)
print(robust_scaled_data)

[[-0.5 ]
 [-0.25]
 [ 0.  ]
 [ 0.25]
 [19.  ]]

MaxAbs Scaling

MaxAbs Scaling scales each feature by its maximum absolute value. The formula is:

X' = X / |X_max|

Example:

Let's scale a feature using MaxAbs Scaling:

import numpy as np
from sklearn.preprocessing import MaxAbsScaler
                    
data = np.array([[1], [-2], [3], [-4], [5]])
scaler = MaxAbsScaler()
maxabs_scaled_data = scaler.fit_transform(data)
print(maxabs_scaled_data)

[[ 0.2]
 [-0.4]
 [ 0.6]
 [-0.8]
 [ 1. ]]

Conclusion

Feature scaling is a crucial preprocessing step in machine learning. It ensures that all features contribute equally to the distance calculations and gradient updates, leading to better model performance and faster convergence. Depending on the nature of your data and the algorithm you are using, you can choose the appropriate scaling method.

Feature Scaling Tutorial

Introduction to Feature Scaling

Why is Feature Scaling Important?

Types of Feature Scaling

Min-Max Scaling

Standardization (Z-score normalization)

Robust Scaling

MaxAbs Scaling

Conclusion