Quantization & LoRA
Making giant models fit and finetune
A modern model has billions of weights. Two problems follow: it's huge to store and run, and it's expensive to fine-tune. Two complementary tricks solve them. Quantization stores each weight in fewer bits to shrink the model. LoRA freezes the model and trains a tiny add-on instead of all the weights.
Store weights as int8 or int4 instead of 32-bit floats — up to 4–8× smaller, with minimal quality loss.
Keep the big weights frozen; learn two small low-rank matrices that nudge them for your task.
Fine-tune a quantized model with LoRA — train a huge model on a single GPU.
Coarsen the bits, then bolt on an adapter
First watch full-precision weights snap to a handful of coarse levels — same values, far less memory. Then see LoRA freeze the big weight matrix and add a small trainable detour.
When to reach for each
- Run a big model on smaller/cheaper hardware
- Faster inference, lower memory
- Mostly lossless down to int8; int4 with care
- Cheaply specialize a base model to your data
- Adapters are tiny (MBs) and swappable
- The base stays shared across many tasks
LoRA is a kind of fine-tuning (PEFT — parameter-efficient fine-tuning). Because the base weights never change, you can keep one frozen model and hot-swap small adapters for different tasks — chat, code, a specific writing style — without storing a full copy each time.