Evaluating LLMs — BLEU, ROUGE, LLM-as-Judge

How do you grade an open-ended answer?

For classification, accuracy and F1 work because there's one right label. But for generated text, there are countless good answers — "The cat sat" and "A cat is sitting" are both fine. Evaluation gets genuinely hard.

Three families of approaches: overlap metrics (BLEU, ROUGE) compare against reference texts, LLM-as-judge asks a strong model to rate quality, and human evaluation remains the gold standard.

Compare the approaches

Watch BLEU count matching n-grams (precision), ROUGE count covered n-grams (recall), and an LLM-judge score quality directly.

The metrics

BLEU n-gram precision

Of the n-grams the model produced, how many appear in the reference? Built for translation. Punishes wrong words.

ROUGE n-gram recall

Of the reference's n-grams, how many did the model cover? Built for summarization. Punishes missing content.

LLM-as-judge model rates model

Prompt a strong LLM to score or compare answers on helpfulness, correctness, style. Scales human-like judgment cheaply.

Also seen

Perplexity (how surprised the model is by held-out text), BERTScore (embedding similarity, not exact words), and benchmark suites like MMLU for knowledge.

The blind spots

BLEU / ROUGE weaknesses

Reward surface overlap, not meaning
Penalize valid paraphrases
Need reference texts; weak for open-ended chat

LLM-as-judge weaknesses

Biases: prefers longer answers, its own style, first option
Can be inconsistent or gamed
The judge can be wrong — verify on hard cases

Best practice

Triangulate: cheap automatic metrics for fast iteration, LLM-as-judge for nuanced comparisons (with bias controls like swapping option order), and human review for the final call. No single number captures "good text".