Quantization & LoRA

Gen AI quantization LoRA efficiency

Making giant models fit and finetune

A modern model has billions of weights. Two problems follow: it's huge to store and run, and it's expensive to fine-tune. Two complementary tricks solve them. Quantization stores each weight in fewer bits to shrink the model. LoRA freezes the model and trains a tiny add-on instead of all the weights.

Quantization fewer bits

Store weights as int8 or int4 instead of 32-bit floats — up to 4–8× smaller, with minimal quality loss.

LoRA tiny adapter

Keep the big weights frozen; learn two small low-rank matrices that nudge them for your task.

QLoRA both at once

Fine-tune a quantized model with LoRA — train a huge model on a single GPU.

Coarsen the bits, then bolt on an adapter

First watch full-precision weights snap to a handful of coarse levels — same values, far less memory. Then see LoRA freeze the big weight matrix and add a small trainable detour.

When to reach for each

Quantization
  • Run a big model on smaller/cheaper hardware
  • Faster inference, lower memory
  • Mostly lossless down to int8; int4 with care
LoRA
  • Cheaply specialize a base model to your data
  • Adapters are tiny (MBs) and swappable
  • The base stays shared across many tasks
Adapter, not surgery

LoRA is a kind of fine-tuning (PEFT — parameter-efficient fine-tuning). Because the base weights never change, you can keep one frozen model and hot-swap small adapters for different tasks — chat, code, a specific writing style — without storing a full copy each time.