LLM Evals: Production Monitoring
to Regression Tests

Key Takeaways

The bottleneck in LLM evals isn't which scoring technique to use (LLM-as-judge vs. deterministic checks vs. human annotation), but rather building the operational workflows that turn production failures into reproducible test cases.
Offline and online evals answer different questions because they operate on different data. Offline evals validate known scenarios against reference outputs, while online evals monitor production traces for quality patterns and safety without ground truth.
For AI agents, the relevant unit of evaluation is the full conversation thread. Task completion and user satisfaction emerge across turns, not from individual responses.
LangSmith creates a continuous improvement cycle: when you monitor production usage, you can identify problematic interactions, convert them into test cases, run evaluations to verify your fixes, deploy the improvements, and then monitor production again to confirm the issue is resolved.

Most discussion around LLM evals focuses on scoring techniques. Teams debate LLM-as-judge versus deterministic checks versus human annotation. The specific technique matters less than building a system that closes the loop between production monitoring and regression testing.

Teams often focus on judge prompts and rubrics because that work is highly visible. However, the harder problems lie in the operational mechanics of turning production failures into reproducible test cases. An effective evaluation framework connects these methods to a continuous feedback loop where production traces become regression datasets.

Why LLM evals differ from model benchmarking

Standardized benchmarks measure raw model capabilities against fixed test sets. They answer questions like "How well does this model reason about math?" or "Can it follow complex instructions?" These benchmarks serve model developers and help practitioners compare foundation models before selection.

Evaluating AI agents requires a different approach. Large language models power a wide variety of applications, and each has distinct quality requirements. A RAG-powered application, customer support agent, and a coding assistant each has its own definition of "good" that no benchmark can capture.

The retrieval step needs to surface relevant documents, and the generation step must synthesize them correctly for your domain. AI agents also need to call the right tools in the correct sequence. Each of these dimensions requires custom evaluation criteria.

Teams often start by searching for a universal quality metric before shifting toward measuring what matters for their specific use case. A legal research assistant and a customer support bot might use the same underlying model, but success looks completely different. The assistant needs precise citations; the bot needs conversational resolution.

Evaluating AI agents brings challenges that benchmark leaderboards miss: non-determinism, open-ended outputs, and new risk types.

What makes evaluating AI agents difficult

Building AI agents introduces three novel and overlapping challenges that traditional software testing doesn't address.

Non-deterministic outputs make reproducibility difficult

The same prompt can yield different responses across runs. Temperature settings, context length, and model updates all introduce variance. A test that passes today might fail tomorrow with identical inputs, not because your application broke, but because the model generated different outputs with alternative phrasing.

Open-ended tasks lack single correct answers

When you ask a model to summarize a document, explain a concept, or draft a response, there's no reference string to match against. Summarization quality depends on whether the model captured key points, maintained accuracy, and matched the appropriate level of detail. "Correct" becomes a judgment call about tone, completeness, accuracy, and relevance, all at once. Traditional unit testing assumes you can specify expected outputs, but AI agents rarely offer that luxury.

New risk types require specialized testing

Hallucinations, jailbreaks, toxicity, and data leaks don't map to conventional error categories. A model might produce fluent, confident text that lacks factual accuracy. It might respond helpfully to an adversarial prompt it should have refused. These failure modes are hard to predict and need evaluation approaches designed for probabilistic systems. Many workflows implement guardrails to catch safety violations before responses reach users.

These three challenges demonstrate why production AI agents require multiple evaluation approaches working together in coordination. Non-deterministic outputs, open-ended tasks without single correct answers, and new risk types that don't map to conventional error categories each demand different testing strategies. No single evaluation method can address all of these dimensions simultaneously, which is why successful AI agent development relies on a comprehensive evaluation framework that combines deterministic checks, LLM-based judges, and human annotation to form a complete quality assessment system.

Choosing the right evaluator for the failure mode

The right evaluator choice depends on what you're trying to catch and whether you have reference outputs to compare against.

Evaluator Type	Best For	Limitations	When to Use
Deterministic code checks	Format validation, schema compliance, safety filters	Cannot assess semantic quality	Every run, cheap and fast
LLM-as-judge	Qualitative assessment, tone, helpfulness, reasoning quality	Requires calibration and can be gamed	When ground truth is sparse or subjective
Human annotation via Annotation Queues	Edge cases, domain expertise, calibrating automated evaluators	Expensive and does not scale	Creating golden datasets and validating judge alignment

Deterministic checks for format and safety

Deterministic checks catch format violations reliably. If your application should return JSON, a Python code evaluator can verify valid deserialization and perform validation on every response. If certain topics are off-limits, a classifier can flag them before they reach users. These checks are fast, cheap, and should run on every trace.

LLM judges for qualitative assessment

LLM judges scale qualitative assessment where ground truth is sparse. You define criteria like "Is the response concise?", "Does it contain PII?", or "Is the tone professional?", and an LLM grades responses against your rubric. LangSmith supports custom evaluators that can score thousands of traces automatically, providing statistically significant feedback on system performance. As a tradeoff, judge prompts need tuning and periodic validation against human judgment.

Human annotation for calibration and edge cases

Human annotation via Annotation Queues handles cases that require domain expertise. For specialized domains like law or medicine, expected output quality requires subject matter expertise that automated methods can't replicate. Annotation Queues streamline the workflow for clinicians, lawyers, analysts, product managers, and other non-engineering roles to review, label, and correct complex traces. This human feedback then becomes the basis for golden datasets that benchmark automated evaluators.

Deploying evaluators in combination

Use deterministic checks for format, LLM-as-judge for qualitative assessment at scale, and human annotation for calibration and edge cases. Most production systems need all three.

Knowing which evaluator to use is only half the problem. You also need to know when to run them, which depends on whether you're testing before deployment or monitoring in production.

Offline versus online evals

The decisive difference between offline and online evals is whether you have reference outputs.

Offline evals validate known scenarios

Offline evals run against datasets with curated test cases and expected answers. You control the examples and can check correctness against references because you've defined what correct looks like for each input. Regression testing happens here. You build a dataset of known scenarios, run your application against it, and verify that changes don't break existing behavior. You can compare model outputs against ground truth, which enables evaluation metrics that require reference answers.

Online evals monitor production behavior

Online evals target production traces, which represent real user interactions without reference outputs. You don't control the examples, and users send whatever they want. Without expected answers, you can't check correctness in the traditional sense. Instead, online evals focus on quality patterns, safety, and real-world behavior. You ask questions like "Is this response helpful?" or "Did the agent stay on topic?" rather than "Does this match the reference?"

Two common failure modes

Conflating when to use offline vs. online evals leads to predictable problems.

Over-engineering pre-deployment tests: Workflows often try to anticipate every production scenario in advance, building elaborate test suites for situations that only emerge from real user interactions.
Measuring the unmeasurable in production: Production monitoring sometimes tries to assess correctness when no ground truth exists, asking "Is this right?" when the only valid questions are "Is this safe?" and "Is this coherent?"

LangSmith supports both evaluation types as complementary parts of a complete quality system. The platform combines observability with evaluation tooling to provide visibility across your agent's entire lifecycle. Offline evals validate known scenarios before deployment. Online evals run on sampled production traffic in real-time, detecting drift or quality degradation as it happens.

The real power comes from connecting them. Online monitoring surfaces issues that become offline test cases and offline validation confirms fixes before they ship.

How production failures become regression tests

Most workflows follow a predictable pattern when handling production failures: A user reports a problem, an engineer investigates, finds the failed trace, fixes the issue, and ships. Three months later, a similar prompt triggers the same failure. The regression wasn't prevented because the fix was local and test coverage remained unchanged.

With LangSmith you can close this loop by adding a problematic trace identified in production to a dataset with a single click, and that dataset becomes the ground truth for regression testing. Once you fix a bug, it stays fixed. This workflow connects debugging, monitoring, and testing without manually reconstructing failing scenarios.

The data flywheel in practice

LangSmith’s data flywheel drives a continuous feedback loop for teams building agents:

Traces flow into LangSmith from production
Insights Agent surfaces usage pattern insights and potential issues
Those insights inform datasets for testing
Evals run against datasets to validate behavior
Improvements ship to users
New traces verify the changes in production

This cycle lets you iterate quickly and optimize agent quality over time. LLM evals become a continuous discipline, rather than a one-time check.

Evaluating AI agents across conversation threads

When you evaluate AI agents, complexity increases because the relevant unit is usually the full conversation thread. User sentiment and task completion emerge across turns rather than from individual outputs.

Why single-turn metrics fall short

When your customer support chatbot handles a multi-step request, the first response might be a clarifying question. That is correct behavior, but not a resolution. The second response might retrieve documentation to show progress, but the interaction remains incomplete. The final response might confirm the action taken, but this resolution only matters in the context of what came before. Evaluating any single turn in isolation misses whether the agent actually solved the user's problem.

LangSmith supports threads as a first-class evaluation target. Threads are sequences of related traces representing multi-turn conversations, and they provide the context window for evals. Online evaluators can run at the thread level to assess the full interaction rather than individual responses.

What to measure in multi-turn evals

Multi-turn evals measure what matters most for agent performance.

Task completion: Did the agent accomplish what the user asked for?
User outcome / satisfaction: Did the interaction end with the user's goal achieved?
Agent trajectory: Did the agent take an efficient path, including appropriate tool calls?

These evals run automatically once a thread completes, using LLM-as-judge prompts configured for your specific criteria. Thread-level outcomes often reveal that single-turn metrics don't correlate with actual user success.

Implementation phases for continuous evaluation

Building a continuous evaluation system doesn't require implementing everything at once. You can start with foundational capabilities and layer in sophistication as your application matures.

Phase 1: Establish the foundation

Enable tracing across all LLM calls, tool invocations, API interactions, and retrieval steps. Tracing costs very little and provides the raw material for everything else.

Add deterministic checks for critical format requirements like JSON validity, required fields, and length limits. These checks are also cheap and catch obvious failures on every run.

Build your first high-quality dataset from a handful of representative examples. Even ten well-chosen examples establish a baseline for regression detection.

Phase 2: Gate releases on regression tests

Turn production failures into regression tests so that each bug you fix becomes a test case through LangSmith's trace-to-dataset workflow.

Before shipping prompt changes or model swaps, establish baseline metrics on your core dataset and run experiments to compare outputs before and after each change. For high-stakes changes, use human review via Annotation Queues so domain experts can validate that improvements in one area don't introduce regressions elsewhere.

Phase 3: Monitor continuously

Run online evaluators to sample production traffic and assess quality patterns and safety. Configure thread-level evals for multi-turn agent interactions and set up alerting to catch score degradation or anomalous patterns.

For RAG applications specifically, evaluate both the retrieval and generation stages. Check that your embedding pipeline surfaces relevant documents and that the retrieved context actually supports the generated answer. The docs your system retrieves should directly inform the response quality.

When you treat LLM evals as a continuous engineering discipline, you catch quality degradation before users do. The goal is a system that learns from production and prevents the same failure from happening twice.

Close the loop on AI agent quality

When you ask how to close the loop, you build a more durable quality system than workflows that optimize scoring techniques in isolation. The feedback loop from production traces to regression datasets makes quality improvements stick.

LangSmith connects these pieces into a repeatable system. Tracing captures every step of your application's behavior. Annotation Queues bring human expertise directly into the workflow. The Insights Agent surfaces patterns worth investigating. Datasets grow from real production failures rather than imagined scenarios. Together, these capabilities create an evaluation discipline that scales with your application.

Explore LangSmith to see how tracing, datasets, and evals work together, and start building the feedback loop that turns production issues into systematic quality improvements.

LLM Evals: Production Monitoringto Regression Tests