Python Advanced - High Performance Computing with CUDA

Accelerating computations using CUDA for Python

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows developers to use NVIDIA GPUs for general purpose processing, an approach known as GPGPU (General-Purpose computing on Graphics Processing Units). This tutorial explores how to accelerate computations using CUDA for Python.

Key Points:

CUDA is a parallel computing platform and API model by NVIDIA.
It allows for general-purpose processing on NVIDIA GPUs.
CUDA can significantly accelerate computations by leveraging the parallel processing power of GPUs.

Installing CUDA and PyCUDA

To use CUDA with Python, you need to install CUDA and PyCUDA. Follow the official [NVIDIA CUDA installation guide](https://docs.nvidia.com/cuda/).

To install PyCUDA, use pip:


pip install pycuda

Writing a Simple CUDA Kernel

Here is an example of writing a simple CUDA kernel using PyCUDA:


import pycuda.autoinit
import pycuda.driver as cuda
import numpy as np
from pycuda.compiler import SourceModule

# Define the CUDA kernel
mod = SourceModule("""
__global__ void add(int *a, int *b, int *c)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    c[idx] = a[idx] + b[idx];
}
""")

# Initialize data
a = np.random.randint(0, 10, size=10).astype(np.int32)
b = np.random.randint(0, 10, size=10).astype(np.int32)
c = np.zeros_like(a)

# Allocate memory on the GPU
a_gpu = cuda.mem_alloc(a.nbytes)
b_gpu = cuda.mem_alloc(b.nbytes)
c_gpu = cuda.mem_alloc(c.nbytes)

# Copy data to the GPU
cuda.memcpy_htod(a_gpu, a)
cuda.memcpy_htod(b_gpu, b)

# Get the kernel function
add = mod.get_function("add")

# Execute the kernel
add(a_gpu, b_gpu, c_gpu, block=(10, 1, 1), grid=(1, 1))

# Copy the result back to the CPU
cuda.memcpy_dtoh(c, c_gpu)

print(f"a: {a}")
print(f"b: {b}")
print(f"c: {c}")

Matrix Multiplication with CUDA

Here is an example of performing matrix multiplication using CUDA and PyCUDA:


# Define the CUDA kernel
mod = SourceModule("""
__global__ void matmul(float *a, float *b, float *c, int N)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    float value = 0;
    for (int k = 0; k < N; ++k)
    {
        value += a[row * N + k] * b[k * N + col];
    }
    c[row * N + col] = value;
}
""")

# Initialize data
N = 32
a = np.random.randn(N, N).astype(np.float32)
b = np.random.randn(N, N).astype(np.float32)
c = np.zeros((N, N), dtype=np.float32)

# Allocate memory on the GPU
a_gpu = cuda.mem_alloc(a.nbytes)
b_gpu = cuda.mem_alloc(b.nbytes)
c_gpu = cuda.mem_alloc(c.nbytes)

# Copy data to the GPU
cuda.memcpy_htod(a_gpu, a)
cuda.memcpy_htod(b_gpu, b)

# Get the kernel function
matmul = mod.get_function("matmul")

# Execute the kernel
block_size = (16, 16, 1)
grid_size = (N // block_size[0], N // block_size[1])
matmul(a_gpu, b_gpu, c_gpu, np.int32(N), block=block_size, grid=grid_size)

# Copy the result back to the CPU
cuda.memcpy_dtoh(c, c_gpu)

print("Matrix multiplication result:")
print(c)

Performance Comparison

Here is an example of comparing the performance of CPU and GPU computations:


import time

# CPU computation
start_time = time.time()
c_cpu = np.dot(a, b)
end_time = time.time()
print(f"CPU computation time: {end_time - start_time:.6f} seconds")

# GPU computation
start_time = time.time()
matmul(a_gpu, b_gpu, c_gpu, np.int32(N), block=block_size, grid=grid_size)
cuda.memcpy_dtoh(c, c_gpu)
end_time = time.time()
print(f"GPU computation time: {end_time - start_time:.6f} seconds")

# Verify the results
print(f"Difference: {np.max(np.abs(c_cpu - c))}")

Advanced Topics: Memory Management

Here is an example of advanced memory management with CUDA:


# Pinned memory allocation
a_pinned = cuda.pagelocked_empty_like(a)
b_pinned = cuda.pagelocked_empty_like(b)
c_pinned = cuda.pagelocked_empty_like(c)

# Copy data to pinned memory
np.copyto(a_pinned, a)
np.copyto(b_pinned, b)

# Allocate GPU memory
a_gpu = cuda.mem_alloc(a_pinned.nbytes)
b_gpu = cuda.mem_alloc(b_pinned.nbytes)
c_gpu = cuda.mem_alloc(c_pinned.nbytes)

# Copy data to the GPU
cuda.memcpy_htod(a_gpu, a_pinned)
cuda.memcpy_htod(b_gpu, b_pinned)

# Execute the kernel
matmul(a_gpu, b_gpu, c_gpu, np.int32(N), block=block_size, grid=grid_size)

# Copy the result back to the CPU
cuda.memcpy_dtoh(c_pinned, c_gpu)

print("Pinned memory result:")
print(c_pinned)

Summary

In this tutorial, you learned about accelerating computations using CUDA for Python. CUDA is a parallel computing platform and API model by NVIDIA that allows for general-purpose processing on NVIDIA GPUs. Understanding how to install CUDA and PyCUDA, write CUDA kernels, perform matrix multiplication, compare performance between CPU and GPU, and manage memory can help you leverage CUDA for high-performance computing tasks in Python.