How to Fine-Tune LLMs
A step-by-step guide to adapting large language models to your domain with code samples, best practices, and troubleshooting tips.
1. Introduction to Fine-Tuning
Fine-tuning takes a pre-trained large language model and continues training it on your specific dataset, aligning it to your task, style, or domain.
- Why It Matters: Achieve higher accuracy and consistency than prompt-only methods.
- When to Use: Specialty domains like legal, medical, or internal policies.
2. Preparing Your Dataset
2.1 Data Collection
- Identify representative examples: FAQs, support tickets, code snippets.
- Include edge cases and failure scenarios to improve robustness.
- Balance positive and negative examples for classification tasks.
2.2 Data Formatting
Use JSONL with clear prompt
and completion
fields:
{"prompt": "Summarize the following meeting notes:\n...", "completion": "Action items: ..."}
Tip: Tokenize and inspect lengths to avoid truncation.
3. Choosing a Fine-Tuning Method
3.1 Full-Parameter Tuning
Updates all model weights for maximum flexibility.
Tradeoff: High GPU/TPU cost and memory usage.
3.2 Parameter-Efficient Methods
- LoRA (Low-Rank Adapters): Insert small trainable matrices. Minimal overhead.
- Prefix Tuning: Learn soft prompt tokens. No model weight changes.
- Adapters: Lightweight modules between layers. Easy to switch.
4. Environment Setup
- Python: ≥3.8
- Libraries:
transformers
,datasets
,accelerate
,peft
- Hardware: NVIDIA GPUs with ≥16 GB VRAM or TPU v3.
- Version Control: Track data and code in Git.
Install:
pip install transformers datasets accelerate peft
5. Full-Parameter Fine-Tuning Example
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
# 1. Load model & tokenizer
model_name = 'gpt2-medium'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = load_dataset('json', data_files='train.jsonl')
def preprocess(batch):
inputs = tokenizer(batch['prompt'], truncation=True, padding='max_length')
targets = tokenizer(batch['completion'], truncation=True, padding='max_length')
inputs['labels'] = targets['input_ids']
return inputs
dataset = dataset.map(preprocess, batched=True)
# 2. Training config
training_args = TrainingArguments(
output_dir='out_full', num_train_epochs=3,
per_device_train_batch_size=2, gradient_accumulation_steps=4,
fp16=True, logging_steps=100, save_total_limit=2
)
# 3. Train
trainer = Trainer(model=model, args=training_args, train_dataset=dataset['train'])
trainer.train()
6. LoRA Fine-Tuning Example
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
# 1. Base model
base_model = AutoModelForCausalLM.from_pretrained('gpt2-medium')
# 2. LoRA config
lora_config = LoraConfig(
r=4, lora_alpha=16, target_modules=['c_attn'],
lora_dropout=0.05, task_type='CAUSAL_LM'
)
model = get_peft_model(base_model, lora_config)
tokenizer = AutoTokenizer.from_pretrained('gpt2-medium')
# 3. Training args same as above
# 4. Trainer & train
trainer = Trainer(model=model, args=training_args, train_dataset=dataset['train'])
trainer.train()
7. Hyperparameter Tuning
- Learning Rate: Start at 1e-4 for full, 1e-3 for LoRA.
- Batch Size: Maximize GPU usage without OOM.
- Epochs: 2–5 depending on dataset size.
- Warmup Steps: 5–10% of total steps to stabilize training.
8. Evaluation & Metrics
- Use a held-out validation set (e.g., 10%).
- Automatic Metrics: Perplexity, BLEU, ROUGE, EM accuracy.
- Human Review: Randomly sample outputs for quality checks.
9. Deployment & Monitoring
- Export: Save model with
model.save_pretrained()
. - Inference: Use
transformers.pipeline
or FastAPI. - Logging: Track invocation latency and errors.
- Drift Detection: Periodically evaluate on fresh data.
10. Troubleshooting Tips
- OOM Errors: Reduce batch size or use gradient accumulation.
- Unstable Loss: Lower learning rate or increase warmup.
- Poor Quality: Add more diverse examples or augment data.
- Adapter Issues: Check target_modules and dropout settings.
11. Example Use Cases
11.1 Customer Support
Automate responses using past tickets as training data, reducing response time by 40%.
11.2 Code Completion
Fine-tune on internal repos to suggest idiomatic functions and standards.
11.3 Medical Summaries
Summarize patient notes with 95% accuracy on key medical entities.