Swiftorial Logo
Home
Swift Lessons
Tutorials
Learn More
Career
Resources

Python Advanced - Machine Learning with LightGBM

Utilizing LightGBM for machine learning tasks in Python

LightGBM (Light Gradient Boosting Machine) is a highly efficient and fast gradient boosting framework for machine learning. It is designed to be distributed and efficient with the following advantages: faster training speed, lower memory usage, better accuracy, and support for large-scale data. This tutorial explores how to use LightGBM for machine learning tasks in Python.

Key Points:

  • LightGBM is a highly efficient and fast gradient boosting framework.
  • It offers faster training speed and lower memory usage.
  • LightGBM is designed to handle large-scale data and achieve better accuracy.

Installing LightGBM

To use LightGBM, you need to install it using pip:


pip install lightgbm
            

Loading and Preparing Data

Here is an example of loading and preparing data using Pandas:


import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv('path/to/your/dataset.csv')

# Split the data into features and target
X = data.drop('target_column', axis=1)
y = data['target_column']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
            

Creating and Training a LightGBM Model

Here is an example of creating and training a LightGBM model:


import lightgbm as lgb

# Create a LightGBM dataset
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# Define the model parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

# Train the model
bst = lgb.train(params, train_data, num_boost_round=100, valid_sets=[test_data], early_stopping_rounds=10)
            

Making Predictions

Here is an example of making predictions with the trained LightGBM model:


# Make predictions on the test set
y_pred = bst.predict(X_test, num_iteration=bst.best_iteration)

# Convert probabilities to binary outcomes
y_pred_binary = [1 if x > 0.5 else 0 for x in y_pred]
            

Evaluating the Model

Here is an example of evaluating the model:


from sklearn.metrics import accuracy_score, classification_report

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred_binary)
print(f"Accuracy: {accuracy}")

# Print the classification report
report = classification_report(y_test, y_pred_binary)
print(report)
            

Feature Importance

Here is an example of visualizing feature importance:


import matplotlib.pyplot as plt

# Get feature importance
importance = bst.feature_importance()
feature_names = X_train.columns

# Create a DataFrame for visualization
feature_importance = pd.DataFrame({'feature': feature_names, 'importance': importance}).sort_values(by='importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.gca().invert_yaxis()
plt.show()
            

Hyperparameter Tuning

Here is an example of performing hyperparameter tuning with GridSearchCV:


from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'num_leaves': [31, 50],
    'learning_rate': [0.05, 0.1],
    'feature_fraction': [0.8, 0.9]
}

# Initialize the model
model = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', metric='binary_logloss')

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")

# Predict using the best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

# Calculate the accuracy
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Accuracy with best model: {accuracy_best}")
            

Saving and Loading the Model

Here is an example of saving and loading a LightGBM model:


# Save the model
bst.save_model('lightgbm_model.txt')

# Load the model
bst_loaded = lgb.Booster(model_file='lightgbm_model.txt')

# Verify the loaded model
y_pred_loaded = bst_loaded.predict(X_test, num_iteration=bst_loaded.best_iteration)
            

Using LightGBM with Dask

Here is an example of using LightGBM with Dask for distributed computing:


import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client

# Start a Dask client
client = Client()

# Create Dask DataFrames
X_train_dask = dd.from_pandas(X_train, npartitions=4)
y_train_dask = da.from_array(y_train.values, chunks=(1000,))

# Train the model with Dask
dask_model = lgb.DaskLGBMClassifier(boosting_type='gbdt', objective='binary', metric='binary_logloss')
dask_model.fit(X_train_dask, y_train_dask)

# Predict using the Dask model
y_pred_dask = dask_model.predict(X_test)
            

Summary

In this tutorial, you learned about utilizing LightGBM for machine learning tasks in Python. LightGBM is a highly efficient and fast gradient boosting framework designed to handle large-scale data. Understanding how to load and prepare data, create and train a model, make predictions, evaluate the model, visualize feature importance, perform hyperparameter tuning, save and load models, and use LightGBM with Dask can help you leverage LightGBM for various machine learning applications.