What Is Fine-Tuning

Fine-tuning is the process of continuing to train a pre-trained foundation model on a task-specific dataset so the model's weights are adjusted to produce outputs that match the target behaviour more reliably. This is not prompt engineering — you are actually changing the model. The weights update. The changes persist across every subsequent call without needing to be re-specified in the prompt.

Foundation models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are generalised. They are trained to do everything competently. Fine-tuning specialises them. When you need specific tone, specific output format, or specific domain knowledge to be reliably embedded in model behaviour — and when prompt engineering alone cannot get you there consistently — fine-tuning is the right tool.

When to Fine-Tune vs When Not To

This decision is worth getting right. Fine-tuning costs money, takes time, and creates a model artefact you have to version and maintain. Prompt engineering is free and instant. Start there.

Fine-tune when:

Style, tone, or format consistency is the primary problem. If you need every output to follow a specific structure — JSON with defined fields, a particular writing register, responses always under 100 words — fine-tuning bakes this in more reliably than system prompts.
You have a narrow task with stable patterns. Classifying customer support tickets into 12 categories. Extracting structured data from insurance forms. Generating product descriptions in a specific brand voice. Tasks where the input/output pattern is well-defined and does not change frequently.
You have 1,000 or more high-quality labelled examples. Below this threshold, the fine-tuned model is likely to underperform a well-prompted base model.
You need to reduce inference cost by using a smaller model. Fine-tuning GPT-4o mini to match GPT-4o's performance on a specific task lets you serve that task at a fraction of the cost. This is one of the most compelling production use cases for fine-tuning.

Do not fine-tune when:

The primary problem is factuality or access to current information. Fine-tuning does not update the model's knowledge effectively — it adjusts behaviour, not the knowledge base. Use RAG instead.
Information changes frequently. A fine-tuned model reflects the training data at the time of training. If your product catalogue, pricing, or policies change often, the fine-tuned model becomes stale.
You have fewer than 500–1,000 high-quality examples. Small datasets lead to overfitting and degraded general capability.
You have not yet exhausted prompt engineering. Few-shot prompts, system message refinement, and structured output formatting solve many consistency problems without the cost and complexity of fine-tuning.

What You Need Before Starting

A task with clear, measurable success criteria. If you cannot define what "correct" looks like precisely enough to label training examples, you are not ready to fine-tune. Ambiguous labelling produces a model that learns ambiguous patterns.

A high-quality dataset. The minimum viable dataset for fine-tuning depends on the task complexity, but 1,000 input/output pairs is a reasonable floor. For robust results on complex tasks, 5,000–10,000 examples is more appropriate. Quality matters more than quantity — 500 excellent examples outperform 5,000 noisy ones.

A held-out evaluation set. Reserve 10–20% of your data for evaluation before you start training. This is the set you use to measure whether the fine-tuned model actually outperforms the base model on your task. Without it, you have no objective measure of improvement.

Budget clarity. Fine-tuning GPT-4o mini via the OpenAI API costs approximately $25 per million tokens for training input at current pricing. A 1,000-example dataset with average example length of 500 tokens costs roughly $12–25 in training compute.

Fine-Tuning Methods

Full Fine-Tuning

Update all model weights. Maximum flexibility and performance ceiling. Practical only for smaller models (7B parameters and below) or teams with significant GPU infrastructure. Not practical for teams fine-tuning via API.

LoRA (Low-Rank Adaptation)

Add small trainable weight matrices to specific layers of the model without modifying the base weights. During training, only these low-rank matrices update. The result is a small adapter file that, when combined with the base model, produces the fine-tuned behaviour. LoRA reduces compute requirements by an order of magnitude compared to full fine-tuning. It is the dominant practical method for teams fine-tuning open-source models. Available via the Hugging Face PEFT library. The original paper is at arxiv.org/abs/2106.09685.

RLHF (Reinforcement Learning from Human Feedback)

Train a reward model on human preference data, then use it to guide further training. This is how ChatGPT, Claude, and Gemini are aligned with human preferences. It requires significant infrastructure, human annotation pipelines, and ML research capability. Not practical for most product teams.

API Fine-Tuning

OpenAI, Google (via Vertex AI), and others offer fine-tuning through their APIs. You provide training data in JSONL format, trigger a training job, and receive a model ID for the fine-tuned model. No compute management required. This is the right approach for most product teams. Docs: platform.openai.com/docs/guides/fine-tuning.

Evaluating the Result

Compare the fine-tuned model against the base model using your held-out evaluation set. Measure the metrics that match your task: F1 score for classification tasks, format adherence rate for structured output tasks, human preference rating for generation quality tasks.

Watch for catastrophic forgetting — the fine-tuned model underperforming on general tasks it previously handled well. This is a real risk when fine-tuning on narrow datasets. Test the fine-tuned model on tasks outside the training domain to confirm it has not degraded across the board.

If the fine-tuned model does not outperform a well-prompted base model on your eval set, the problem is usually the training data — either too few examples, inconsistent labelling, or examples that do not accurately represent the production input distribution.

Production Considerations

A fine-tuned model is a version-controlled artefact. Treat it like any other production dependency. It needs a model registry entry with the training dataset version, training date, and evaluation results. It needs rollback capability — keep the previous fine-tuned model available so you can revert if a new version degrades. It needs performance monitoring post-deployment. Fine-tuned models drift as the input distribution shifts. Track your task-specific metrics in production, not just at evaluation time.

Common Failure Modes

Training on noisy data: Inconsistent labels teach the model inconsistent patterns. Invest in data quality review — have multiple human reviewers label examples for agreement checking before using them for training.

Evaluating only on training distribution: If your eval set was drawn from the same source as your training set, you are measuring memorisation not generalisation. Use a held-out eval set that reflects real production inputs.

Fine-tuning before prompt engineering: Many teams fine-tune to solve problems that a well-written system prompt would solve for free. Always attempt structured prompting first. If you cannot achieve 80–85% of your quality target with prompting, then fine-tuning is justified.

References

OpenAI fine-tuning guide: platform.openai.com/docs/guides/fine-tuning
Hugging Face PEFT library: huggingface.co/docs/peft
LoRA paper: arxiv.org/abs/2106.09685

Talk to an AI Implementation Expert

If you want help deciding whether fine-tuning is the right tool for your use case or designing a training dataset, book a working session.

Book a call: https://calendly.com/ai-creation-labs/30-minute-chatgpt-leads-discovery-call

During the call we can cover:

fine-tuning vs RAG vs prompting decision for your specific task
dataset requirements and labelling strategy
model selection and cost modelling
evaluation framework before and after training