Model Tuning | Model Evaluation | Datascience Tutorial

Introduction

Model tuning is a crucial step in the machine learning pipeline. It involves optimizing the performance of a model by adjusting its hyperparameters. This tutorial will guide you through the process of model tuning from start to finish with detailed explanations and examples.

Understanding Hyperparameters

Hyperparameters are parameters that are not learned during the training process but are set before the training begins. Examples of hyperparameters include the learning rate, number of trees in a random forest, and the number of layers in a neural network. Tuning these hyperparameters can significantly impact the performance of a model.

Grid Search

Grid search is a common technique for hyperparameter tuning. It involves exhaustively searching through a specified subset of hyperparameters. Here's an example of how to use grid search with Scikit-learn's GridSearchCV:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier()

# Define the grid of hyperparameters
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [None, 10, 20, 30]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Print best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

Random Search

Random search is another technique for hyperparameter tuning. It involves randomly sampling the hyperparameter space. This can be more efficient than grid search, especially when dealing with a large number of hyperparameters. Here's an example using Scikit-learn's RandomizedSearchCV:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier()

# Define the grid of hyperparameters
param_dist = {
    'n_estimators': [int(x) for x in np.linspace(start=100, stop=1000, num=10)],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [int(x) for x in np.linspace(10, 110, num=11)]
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=100, cv=5, scoring='accuracy', random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)

# Print best parameters and score
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Bayesian Optimization

Bayesian optimization is a more advanced technique for hyperparameter tuning. It builds a probabilistic model of the objective function and uses it to select the most promising hyperparameters to evaluate. Libraries like hyperopt and scikit-optimize can be used for Bayesian optimization. Here's an example using hyperopt:

from hyperopt import fmin, tpe, hp, Trials
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Define the objective function
def objective(params):
    model = RandomForestClassifier(**params)
    score = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=5).mean()
    return -score

# Define the search space
space = {
    'n_estimators': hp.choice('n_estimators', [100, 200, 300, 400, 500]),
    'max_features': hp.choice('max_features', ['auto', 'sqrt', 'log2']),
    'max_depth': hp.choice('max_depth', [None, 10, 20, 30, 40, 50])
}

# Initialize trials object
trials = Trials()

# Run the optimization
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=100, trials=trials)
print("Best Parameters:", best)

Cross-Validation

Cross-validation is an important technique in model tuning that helps to assess how the results of a model will generalize to an independent dataset. It involves partitioning the data into subsets, training the model on some subsets, and validating it on the remaining subsets. Here’s an example using Scikit-learn's cross_val_score:

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(n_estimators=200, max_features='auto', max_depth=20)

# Perform cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

# Print cross-validation scores
print("CV Scores:", cv_scores)
print("Mean CV Score:", np.mean(cv_scores))

Final Model Training

After tuning the hyperparameters, you should retrain your model on the entire training dataset using the best hyperparameters. This ensures that your model benefits from the full dataset and is optimally configured. Here’s an example:

from sklearn.ensemble import RandomForestClassifier

# Define the best model with tuned hyperparameters
best_model = RandomForestClassifier(n_estimators=200, max_features='auto', max_depth=20)

# Fit the model on the entire training dataset
best_model.fit(X_train, y_train)

# Evaluate the model on the test dataset
test_score = best_model.score(X_test, y_test)
print("Test Score:", test_score)

Model Tuning Tutorial