Evaluating LLMs — BLEU, ROUGE, LLM-as-Judge

Gen AI evaluation BLEU ROUGE LLM-as-judge

How do you grade an open-ended answer?

For classification, accuracy and F1 work because there's one right label. But for generated text, there are countless good answers — "The cat sat" and "A cat is sitting" are both fine. Evaluation gets genuinely hard.

Three families of approaches: overlap metrics (BLEU, ROUGE) compare against reference texts, LLM-as-judge asks a strong model to rate quality, and human evaluation remains the gold standard.

Compare the approaches

Watch BLEU count matching n-grams (precision), ROUGE count covered n-grams (recall), and an LLM-judge score quality directly.

The metrics

BLEU n-gram precision

Of the n-grams the model produced, how many appear in the reference? Built for translation. Punishes wrong words.

ROUGE n-gram recall

Of the reference's n-grams, how many did the model cover? Built for summarization. Punishes missing content.

LLM-as-judge model rates model

Prompt a strong LLM to score or compare answers on helpfulness, correctness, style. Scales human-like judgment cheaply.

Also seen

Perplexity (how surprised the model is by held-out text), BERTScore (embedding similarity, not exact words), and benchmark suites like MMLU for knowledge.

The blind spots

BLEU / ROUGE weaknesses
  • Reward surface overlap, not meaning
  • Penalize valid paraphrases
  • Need reference texts; weak for open-ended chat
LLM-as-judge weaknesses
  • Biases: prefers longer answers, its own style, first option
  • Can be inconsistent or gamed
  • The judge can be wrong — verify on hard cases
Best practice

Triangulate: cheap automatic metrics for fast iteration, LLM-as-judge for nuanced comparisons (with bias controls like swapping option order), and human review for the final call. No single number captures "good text".