LLM Fine-tuning · Suman Bhadra Notes

What it is

Fine-tuning means taking a pretrained base LLM that already knows general language and training it a bit more on your own examples, so it specializes — picking up your tone, your output format, or your domain. You are not building a model from scratch; you are nudging an existing one in a direction you care about.

Think of it like hiring a smart generalist graduate. They already read, write, and reason well — that's the years of pretraining. You don't re-teach them language; you give them a few weeks of on-the-job training so they learn your company's way of doing things: the house style, the templates, the jargon, the edge cases. Fine-tuning is that on-the-job training, applied to the model's weights.

Recall the three stages a chat LLM goes through (see What is an LLM): pretraining on trillions of tokens, supervised fine-tuning (SFT) on instruction pairs, and preference tuning (RLHF / DPO). What people usually mean by "fine-tuning your own model" is the same machinery as that SFT stage — continued training on a curated set of examples — but starting from an already-capable base or chat model and steering it toward your specific task.

In one sentence

Fine-tuning is continued training of a pretrained LLM on your own input→output examples, baking a specific behavior, style, or skill directly into its weights.

Fine-tuning vs prompting vs RAG

Before reaching for fine-tuning, it helps to see it next to the two cheaper ways of steering an LLM. All three change what the model produces — but only one of them actually touches the weights.

Prompting instructions in context — no training

Just tell the model what you want, in the prompt: a system message, a few examples, a format spec. Nothing is trained, nothing is stored. It's the cheapest and most instant lever, and the right first thing to try. The catch: everything lives in the context window and is re-sent on every call.

RAG inject external knowledge at query time

Retrieve relevant documents from a knowledge base and paste them into the prompt before the model answers. Best when the model needs facts that change — prices, policies, this week's docs. You update a database, not a model. See How RAG Works.

Fine-tuning bake behavior into the weights

Actually train the model on your examples so the desired behavior is encoded in its parameters. Best for a consistent format, tone, or skill you want by default — without spelling it out every time. Slower and costlier to set up, but the behavior is then "free" at inference.

Rule of thumb

RAG for knowledge, fine-tuning for behavior — and always try prompting first. If a good system prompt and a couple of examples already get you there, you're done. Reach for RAG when the answer depends on facts the model doesn't reliably know, and for fine-tuning when you need a consistent way of responding that prompting can't pin down. The three also stack: you can fine-tune a model and give it RAG.

How it works (the recipe)

Fine-tuning is ordinary supervised learning, applied to a model that already knows a lot. The objective never changes — it's still next-token prediction — only the data and the starting point do.

1 — Gather data input → desired-output pairs

Collect a labelled dataset of examples in exactly the shape you want at inference: a prompt and the ideal response. A few hundred to a few thousand high-quality pairs often beats tens of thousands of noisy ones.

2 — Start from the base load pretrained weights

Initialise from the base model's existing weights — never from scratch. All the general language ability is already there; you're only adjusting it. This reuse of pretrained knowledge is transfer learning.

3 — Train gradient descent, a few epochs

Run the usual loop on your data: forward pass → compute the loss against the desired output → backpropagate → update the weights. Use a small learning rate and only a few epochs — you're tweaking, not rebuilding.

4 — Evaluate check, don't overcook

Hold out a validation set and watch for overfitting (great on training data, worse in the wild) and catastrophic forgetting (the model gets better at your task but worse at everything else). Stop as soon as it stops improving.

Why a small learning rate matters

The base model sits at a carefully balanced set of weights. A large learning rate would shove it far from that point and wipe out general ability; a small one nudges it just enough to absorb your examples while keeping everything else intact.

Full fine-tuning vs PEFT (LoRA)

This is the key practical decision. The naive way is to update every weight in the model. The modern, far cheaper way is to freeze the model and train a tiny set of extra parameters alongside it. The second approach — PEFT, parameter-efficient fine-tuning — is what makes fine-tuning affordable, and LoRA is its most popular form.

Full fine-tuning

Updates all of the model's parameters — billions of them
Expensive: needs enough GPU memory to hold the model, its gradients, and optimizer states all at once
Produces a full-size copy of the model per task — huge storage if you have many
Higher risk of catastrophic forgetting, since every weight can drift

LoRA / PEFT

Freezes the base model; trains only small low-rank adapter matrices added alongside the weights
Trains roughly 0.1–1% of the parameters — fits on far smaller hardware
The adapter is tiny (megabytes) and swappable — keep many task adapters over one shared base
The frozen base can't drift, so general ability is largely preserved

How LoRA works

For a weight matrix W, instead of updating W directly, LoRA learns a small update and adds it: the effective weight becomes W + B·A, where A and B are low-rank matrices (a thin r×d and a thin d×r, with rank r tiny — often 8 or 16). Their product B·A has the same shape as W, but together A and B hold a fraction of the parameters. W stays frozen; only A and B are trained. QLoRA goes further still — it quantizes the frozen base to 4-bit to slash memory, then trains LoRA adapters on top — letting you fine-tune very large models on a single consumer GPU.

LoRA savings calculator

The "trains ~1% of the params" claim is easy to say and hard to feel. Play with it below. Pick the rank r and the base model size, and watch how few parameters LoRA actually trains — and how dramatically the two thin adapter matrices shrink the work compared with touching the whole frozen weight matrix.

rank r base size

frozen weight matrix W (d×d) adapter B (d×r) adapter A (r×d)

Numbers are approximate and illustrative: they assume a representative per-layer dimension d = 4096 and apply LoRA to every weight matrix in the model. Real setups vary, but the order of magnitude — hundreds to thousands of times fewer trainable parameters — is the point.

Watch fine-tuning adjust the model

Each cell in the grid below is one weight in the model. The animation runs both strategies in turn: first full fine-tuning lights up every cell (every parameter is updated); then it resets and runs LoRA, keeping the base frozen and training only a small pair of low-rank adapter blocks beside it. The last step puts the parameter counts side by side. Watch the colours — grey means frozen, orange means updated, blue means the trainable adapter.

Pitfalls

Fine-tuning is powerful but easy to misuse. Most failed fine-tuning projects trip over one of these, and many never needed fine-tuning at all.

Watch out for

Quality data is everything — fine-tuning faithfully copies whatever you show it. Garbage in, garbage out: inconsistent, mislabelled, or biased examples produce an inconsistent, biased model.
Catastrophic forgetting — push too hard and the model gets sharper at your task while losing general skills it used to have. PEFT and a small learning rate reduce this, but don't ignore it.
Overfitting on small datasets — with only a few examples the model can memorize them verbatim and fail to generalize. Hold out a validation set and stop early.
Cost and operational overhead — data collection, training runs, evaluation, hosting, and versioning the result. It's a real engineering commitment, not a one-click toggle.
You usually don't need it — try prompting first, then RAG. Reach for fine-tuning only once you've confirmed those genuinely can't get you there.

When to fine-tune — and when not to

Pull it all together with a single decision. Fine-tuning earns its cost when you need a baked-in behavior; it's the wrong tool when the real need is fresh knowledge or you simply haven't tried the cheaper levers yet.

Fine-tune when

You need a consistent style or output format by default, every time
You're teaching a narrow, specialized skill the base model handles poorly
You want lower latency or cost than stuffing a huge prompt into every call
You need an offline or smaller model to punch above its weight on one task

Don't fine-tune when

The knowledge changes often — use RAG so you update a database, not a model
You have little data — too few examples invite overfitting and weak generalization
Prompting already works — don't pay training and hosting costs for nothing
You need the model to cite or ground its answers in specific sources — that's RAG's job