How to Fine-Tune LLMs

A step-by-step guide to adapting large language models to your domain with code samples, best practices, and troubleshooting tips.

1. Introduction to Fine-Tuning

Fine-tuning takes a pre-trained large language model and continues training it on your specific dataset, aligning it to your task, style, or domain.

Why It Matters: Achieve higher accuracy and consistency than prompt-only methods.
When to Use: Specialty domains like legal, medical, or internal policies.

2. Preparing Your Dataset

2.1 Data Collection

Identify representative examples: FAQs, support tickets, code snippets.
Include edge cases and failure scenarios to improve robustness.
Balance positive and negative examples for classification tasks.

2.2 Data Formatting

Use JSONL with clear prompt and completion fields:

{"prompt": "Summarize the following meeting notes:\n...", "completion": "Action items: ..."}

Tip: Tokenize and inspect lengths to avoid truncation.

3. Choosing a Fine-Tuning Method

3.1 Full-Parameter Tuning

Updates all model weights for maximum flexibility.

Tradeoff: High GPU/TPU cost and memory usage.

3.2 Parameter-Efficient Methods

LoRA (Low-Rank Adapters): Insert small trainable matrices. Minimal overhead.
Prefix Tuning: Learn soft prompt tokens. No model weight changes.
Adapters: Lightweight modules between layers. Easy to switch.

4. Environment Setup

Python: ≥3.8
Libraries: transformers, datasets, accelerate, peft
Hardware: NVIDIA GPUs with ≥16 GB VRAM or TPU v3.
Version Control: Track data and code in Git.

Install:

pip install transformers datasets accelerate peft

5. Full-Parameter Fine-Tuning Example

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# 1. Load model & tokenizer
model_name = 'gpt2-medium'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = load_dataset('json', data_files='train.jsonl')

def preprocess(batch):
    inputs = tokenizer(batch['prompt'], truncation=True, padding='max_length')
    targets = tokenizer(batch['completion'], truncation=True, padding='max_length')
    inputs['labels'] = targets['input_ids']
    return inputs

dataset = dataset.map(preprocess, batched=True)

# 2. Training config
training_args = TrainingArguments(
    output_dir='out_full', num_train_epochs=3,
    per_device_train_batch_size=2, gradient_accumulation_steps=4,
    fp16=True, logging_steps=100, save_total_limit=2
)

# 3. Train
trainer = Trainer(model=model, args=training_args, train_dataset=dataset['train'])
trainer.train()

6. LoRA Fine-Tuning Example

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model

# 1. Base model
base_model = AutoModelForCausalLM.from_pretrained('gpt2-medium')

# 2. LoRA config
lora_config = LoraConfig(
    r=4, lora_alpha=16, target_modules=['c_attn'],
    lora_dropout=0.05, task_type='CAUSAL_LM'
)
model = get_peft_model(base_model, lora_config)

tokenizer = AutoTokenizer.from_pretrained('gpt2-medium')
# 3. Training args same as above

# 4. Trainer & train
trainer = Trainer(model=model, args=training_args, train_dataset=dataset['train'])
trainer.train()

7. Hyperparameter Tuning

Learning Rate: Start at 1e-4 for full, 1e-3 for LoRA.
Batch Size: Maximize GPU usage without OOM.
Epochs: 2–5 depending on dataset size.
Warmup Steps: 5–10% of total steps to stabilize training.

8. Evaluation & Metrics

Use a held-out validation set (e.g., 10%).
Automatic Metrics: Perplexity, BLEU, ROUGE, EM accuracy.
Human Review: Randomly sample outputs for quality checks.

9. Deployment & Monitoring

Export: Save model with model.save_pretrained().
Inference: Use transformers.pipeline or FastAPI.
Logging: Track invocation latency and errors.
Drift Detection: Periodically evaluate on fresh data.

10. Troubleshooting Tips

OOM Errors: Reduce batch size or use gradient accumulation.
Unstable Loss: Lower learning rate or increase warmup.
Poor Quality: Add more diverse examples or augment data.
Adapter Issues: Check target_modules and dropout settings.

11. Example Use Cases

11.1 Customer Support

Automate responses using past tickets as training data, reducing response time by 40%.

11.2 Code Completion

Fine-tune on internal repos to suggest idiomatic functions and standards.

11.3 Medical Summaries

Summarize patient notes with 95% accuracy on key medical entities.

← Back to Articles