Intellectual
← All Insights
AI & Enterprise AI2 April 20248 min read

LLM Evaluation — The Engineering Discipline Most Teams Skip

Without evaluation, every change to an LLM system is a guess. Teams that build evaluation discipline ship with confidence; teams that skip it operate on intuition until production incidents force the issue.

The single most common gap in enterprise LLM deployments we have walked into for remediation is the absence of evaluation discipline. The team has a working system, deployed somewhere between a fortnight and a year ago. Changes are made periodically — to prompts, to models, to retrieval logic. Whether each change improves or regresses the system is not measured. The team operates on intuition until a production incident forces a reckoning.

This is a practitioner view of LLM evaluation as an engineering discipline: what it consists of, what it costs to build, and why it pays for itself within a quarter of being in place.

What evaluation actually is

For an LLM system, evaluation is the practice of:

  1. Curating a representative test set of inputs paired with known good outputs or with quality criteria.
  2. Running the system against the test set to produce outputs.
  3. Scoring those outputs against the criteria, automatically where possible, with human review where necessary.
  4. Tracking the scores over time to detect improvements and regressions.
  5. Gating changes on evaluation results — a regression blocks a change from reaching production.

This is the same shape as software test discipline. The differences are in the test set (curated examples instead of synthesised tests) and the scoring (often probabilistic instead of binary).

What evaluation is not

A few things that look like evaluation but aren't:

  • A few impressive demo examples. Whether the system answers three good questions correctly says little about whether it answers a thousand random questions correctly.
  • User feedback alone. User satisfaction is one signal but it lags, is biased toward easy questions, and doesn't catch silent failures.
  • Vibes-based assessment. "The output looks good." This catches obvious problems and misses subtle ones, including the most expensive ones.
  • Model benchmark scores. Public benchmarks measure model capability in general; they don't measure your system's quality on your workload.

Each of these has a place. None substitutes for a curated evaluation pipeline.

The evaluation set

The most consequential decision is what is in the evaluation set.

The criteria for a useful set:

  • Representative. The distribution of cases in the set mirrors the distribution in production. Easy cases, hard cases, edge cases, in roughly the proportions they appear.
  • Curated by people who know the domain. A test set generated by an LLM may not catch the cases that matter. Domain experts pick examples that exercise the system meaningfully.
  • Includes known failure modes. The cases that previously broke production make it into the set. Once a regression has happened, the regression test prevents it from happening again.
  • Versioned. The set changes over time. Track the version. Compare scores across versions thoughtfully.
  • Manageable in size. Big enough to be representative; small enough to run frequently and to review when needed. A few hundred cases for most systems; a few thousand for higher-stakes systems.

The set is built incrementally. The first version is small. Production traffic and incidents feed new cases. Over a year, the set matures into a reliable signal.

What gets scored

For an LLM system, the things worth scoring depend on the workload. Common dimensions:

  • Correctness — does the output answer the question correctly? Where the answer has a known ground truth, this is binary. Where it doesn't, it requires a more nuanced score.
  • Faithfulness — for retrieval-augmented systems, does the output stay grounded in the retrieved context? Hallucinations get caught here.
  • Completeness — does the output address all parts of the question?
  • Format compliance — for structured outputs, does the response conform to the expected schema?
  • Safety — does the output avoid producing content that violates policy?
  • Citation accuracy — for systems that cite sources, do the citations actually support the claims?

Each dimension is scored separately. The composite picture is more useful than a single overall score.

Scoring methods

Three approaches, often combined:

Programmatic scoring

For dimensions that can be checked deterministically — schema compliance, presence of specific entities, numerical accuracy on aggregations, citation existence — programmatic checks are cheap, fast, and reproducible. Use them wherever possible.

LLM-as-judge

A second LLM scores the output against the criteria. The judge LLM is given the input, the expected criteria, and the output, and asked to score.

This works for dimensions where deterministic checks don't apply — was the answer faithful to the context, was the tone appropriate, did the output address the intent of the question. The judge can be the same model as the one being evaluated; results are similar to using a different model in most contexts.

What to know:

  • LLM-as-judge is reliable for relative comparisons. Less reliable for absolute scores.
  • The judge's instructions matter as much as the judge model.
  • Calibration matters. Run the judge on a sample with human scores; check alignment.

Human scoring

For the highest-stakes dimensions or for calibrating the LLM-as-judge, humans score. Expensive but irreplaceable for some criteria.

The pattern: humans score a sample. The sample calibrates the automated scoring. Most cases use automated scoring; a fraction is human-reviewed to keep the automation honest.

When to evaluate

The triggering events:

  • Every prompt change. A change to a system prompt, a retrieval prompt, an extraction prompt — runs evaluation.
  • Every model upgrade. A move from GPT-4 to GPT-4 Turbo, from Claude 2 to Claude 3 — runs evaluation. Model upgrades are notorious for behaviour changes.
  • Every retrieval logic change. Changes to chunking, embeddings, reranking — all touch system quality.
  • Periodically. Even with no changes, evaluation runs weekly to catch drift from model provider-side changes.
  • On incident. Whenever a production incident reveals a failure mode, that failure mode goes into the evaluation set and re-runs catch the fix and prevent regression.

The cadence matters. A team that evaluates only on major releases catches problems weeks late. A team that evaluates continuously catches problems hours late.

How to run it operationally

The mechanics:

  • A test harness. Code that runs the system against the evaluation set, captures outputs, scores them, produces a report.
  • A regression dashboard. Scores over time, by version, by dimension. Drops surface immediately.
  • CI integration. Pull requests trigger evaluation. Significant regressions block the merge.
  • Cost accounting. Evaluation is not free. Token costs add up. Budget accordingly; sample where the set is large.
  • Result archive. Past evaluation runs preserved. Comparisons over months become possible.

The infrastructure is conventional. Most teams build it themselves; tooling is emerging but still immature.

Where evaluation tends to under-perform

A few patterns where teams build evaluation and still struggle:

The set doesn't include the failure modes. Easy cases pass; the hard cases that produce production incidents aren't in the set. Add them every time a new failure mode appears.

Scoring is too generous. The criteria allow incorrect outputs to score well. Tighten the criteria until they catch the kind of errors that matter.

Evaluation runs but nobody acts on the results. Scores drop; the team notes it and continues. The discipline is in acting on regressions, not in producing them.

The set is too small to detect quality differences. A few dozen cases produce noisy signal. Larger sets, with careful curation, produce reliable signal.

Evaluation scope creep. Every dimension that could be scored is scored. The signal-to-noise drops. Focus on dimensions that determine production fitness.

What we keep seeing

Patterns in teams that have built evaluation discipline:

They ship faster. Counter-intuitive but consistent. The team that knows whether a change improves or regresses ships with confidence; the team without that knowledge ships slowly and cautiously because the cost of regression is unbounded.

They handle model upgrades cleanly. A new model from a provider is treated as a planned migration. Evaluation runs; differences are characterised; decision is informed.

Their production incidents shrink. Each production incident adds to the evaluation set; the next iteration prevents the same incident. Over months, the failure rate drops.

They communicate more effectively with stakeholders. Scores over time are concrete. Stakeholders trust progress they can see.

What we recommend

For an enterprise team building LLM systems in 2024:

  1. Build the evaluation harness in the first month. Skip it and you will rebuild it later under incident pressure.
  2. Start the evaluation set small with cases that matter most. Grow it as the system grows.
  3. Use programmatic scoring where deterministic; LLM-as-judge where not; human scoring for calibration and high stakes.
  4. Run evaluation on every change. Regress and the change doesn't ship.
  5. Calibrate automated scoring against human judgment periodically.
  6. Treat the evaluation set as a living artifact. Versioned, reviewed, owned.

Evaluation discipline is the difference between an LLM system that improves over time and one that drifts. The teams that build it ship with confidence; the teams that skip it ship on hope.

RELATED READING

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.