Evaluating LLMs — BLEU, ROUGE, LLM-as-Judge
How do you grade an open-ended answer?
For classification, accuracy and F1 work because there's one right label. But for generated text, there are countless good answers — "The cat sat" and "A cat is sitting" are both fine. Evaluation gets genuinely hard.
Three families of approaches: overlap metrics (BLEU, ROUGE) compare against reference texts, LLM-as-judge asks a strong model to rate quality, and human evaluation remains the gold standard.
Compare the approaches
Watch BLEU count matching n-grams (precision), ROUGE count covered n-grams (recall), and an LLM-judge score quality directly.
The metrics
Of the n-grams the model produced, how many appear in the reference? Built for translation. Punishes wrong words.
Of the reference's n-grams, how many did the model cover? Built for summarization. Punishes missing content.
Prompt a strong LLM to score or compare answers on helpfulness, correctness, style. Scales human-like judgment cheaply.
Perplexity (how surprised the model is by held-out text), BERTScore (embedding similarity, not exact words), and benchmark suites like MMLU for knowledge.
The blind spots
- Reward surface overlap, not meaning
- Penalize valid paraphrases
- Need reference texts; weak for open-ended chat
- Biases: prefers longer answers, its own style, first option
- Can be inconsistent or gamed
- The judge can be wrong — verify on hard cases
Triangulate: cheap automatic metrics for fast iteration, LLM-as-judge for nuanced comparisons (with bias controls like swapping option order), and human review for the final call. No single number captures "good text".