LLM Evaluation Metrics: Measuring What
Matters for Your Users

Key Takeaways
- Generic benchmarks like BLEU and ROUGE measure standard NLP tasks, but they don't tell you whether your AI agent meets business-specific requirements like tone appropriateness, PII protection, or domain accuracy.
- Custom LLM-as-a-judge evaluators let you define rubric prompts that operationalize what "good" means for your use case, returning structured feedback with both scores and reasoning that makes evaluation actionable.
- Connecting production traces to datasets to evals to improvements, creates a continuous cycle where real user interactions drive systematic quality gains rather than one-off fixes.
Many teams measure whether their AI agent retrieves relevant information. But relevance alone isn't enough. Your agent also needs to communicate in a way that feels natural and appropriate for your users. When a customer support agent sounds robotic or overly formal, users notice. When it's too casual for a compliance scenario, that's a problem too.
The challenge is that "sounds right" means different things for different applications. A friendly tone works for customer service but not for legal advice. Conciseness matters for quick answers but less for educational content. Generic benchmarks can't capture these differences.
This is where custom evaluation metrics come in. Instead of relying on one-size-fits-all measurements, you define what good performance looks like for your specific use case. You turn vague quality goals into concrete, measurable criteria. That's how you move from guessing whether your agent works well to knowing it does.
We built LangSmith Evaluations because we kept seeing the same pattern across teams we work with: they'd ship an AI agent, spot-check a few outputs, and assume things were fine, until users started reporting hallucinations or off-brand responses at scale. LangSmith now processes millions of evaluation runs, and the teams using it have taught us a lot about what actually works when you move from prototyping to production. This post distills those patterns into a practical guide for choosing, building, and operationalizing the metrics that matter for your use case.
What are AI agent evaluation metrics?
Large language models produce non-deterministic outputs, making response quality hard to assess. The same prompt can produce different text generation results across runs. What counts as "correct" often depends on context that changes by use case. AI agent evaluation metrics break down what "good" looks like and measure it. Without explicit metrics, you might rely on spot-checking LLM outputs and gut feelings. That approach works during prototyping but collapses under production scale.
These metrics answer specific questions about AI agent behavior. They measure faithfulness to retrieved context, tone appropriateness for customer-facing interactions, and task completion rates.
When you measure performance across thousands of traces, clear patterns emerge. You can identify specific topics that trigger hallucinations, prompt structures that compromise tone quality, or retrieval configurations that fail to surface relevant documents. Production AI agents need this level of visibility.
Traditional NLP metrics and their limits
Before LLM-as-a-judge approaches became common, NLP evaluation relied heavily on traditional benchmarks like BLEU and ROUGE. These metrics compare generated text against reference outputs using n-gram overlap. They measure how many word sequences match between the model output and a human-written reference. More recent approaches like BERTScore use embeddings to capture semantic similarity rather than exact word matches. Perplexity measures how "surprised" a language model is by a text sequence to approximate fluency.
These traditional metrics remain useful for specific tasks. However, they struggle to capture the subjective quality dimensions that matter for most AI agents. Helpfulness, tone, task completion, and domain accuracy require different measurement approaches. This gap is why you need custom evaluation criteria alongside standard benchmarks when building production agents.
How evaluation metrics are computed
Not all metrics work the same way. Understanding the different evaluation methods helps you choose the right approach for each criterion you need to measure in your AI agent.
Reference-based evaluators
Reference-based evaluators require "gold standard" outputs to compare against. They work well for offline evaluation where you curate test cases with known correct answers. If you test a question-answering system, you need reference answers to measure correctness.
Reference-free evaluators
Reference-free evaluators assess quality without comparing to expected outputs. These work for both offline and online evaluation because they skip ground truth requirements.
LLM-as-a-judge evaluators
An LLM-as-a-judge evaluator checking whether a response is concise doesn't need to know the "correct" concise response. It applies a rubric directly to the output.
Code/functional evaluators
Code/functional evaluators use deterministic logic. Unlike LLM-as-a-judge evaluators that are inherently variable, code evaluators return the exact same result every time for the same input. They fit perfectly for checks with clear right or wrong answers. Examples include valid JSON, required format matching, and token limit compliance.
Metrics as structured feedback
In LangSmith, evaluators return structured feedback objects that capture both the score and the reasoning behind it. This structure makes evaluation outputs traceable, comparable, and actionable when you review generated text quality.
Each feedback object contains:
key: The metric namescore|value: The metric valuecomment(optional): Reasoning or explanation for the score
Here is what a feedback object looks like in practice:
{
"key": "tone_appropriateness",
"score": 0.85,
"comment": "Response maintains professional tone but needs more warmth in greeting"
}The comment field helps turn metrics from opaque numbers into debugging tools. When an LLM-as-a-judge evaluator scores a response low on tone, the comment explains the exact reason. It might say the response uses overly technical language for a general audience. This reasoning helps you understand exactly what to fix.
Structured feedback also enables aggregation and trending. You can track average tone scores over time. You can slice by metadata like models and inputs or user segments. This visibility helps you identify which changes improved or degraded specific metrics.
Choosing scoring types for different criteria
LangSmith supports three feedback types. Choosing the right one depends on what you want to measure. The evaluation process requires matching your scoring approach to the nature of the criterion.
Boolean feedback
Boolean feedback returns true or false. Use it for binary checks where there is no middle ground. Examples include PII presence, correct tool selection, and required format compliance. Boolean scoring works best when you need hard gates that must pass or fail with zero ambiguity.
Categorical feedback
Categorical feedback selects from predefined categories. Use it when you need classification rather than scoring. You might classify response tone as formal, casual, or inappropriate. Categorical feedback captures nuance that boolean checks miss while remaining more interpretable than continuous scores.
Continuous feedback
Continuous feedback provides numerical scoring within a specified range. Use it for criteria that exist on a spectrum, like helpfulness or faithfulness to source material. Continuous scoring enables relative comparisons. It lets you track whether your AI agent improves over time even if you haven't defined what "good enough" means yet.
Boolean and categorical scores offer greater consistency across evaluator runs. By limiting the output to predefined options, these scoring types reduce variability and make results more predictable. Continuous scores offer more granularity but require careful calibration to interpret meaningfully.
Evaluation vs. testing: What's the difference?
Teams often conflate evaluation with testing, but the distinction matters. Evaluation measures relative performance across fuzzy criteria. Testing gates deployment with binary pass/fail checks. When you treat evaluation metrics as hard deployment gates, you create bottlenecks that slow iteration without improving quality. Understanding this difference helps you track metrics that actually drive improvement and guides how you structure offline and online evaluation in your workflow.
Evaluation measures relative performance
Evaluation measures performance using metrics that capture fuzzy or subjective quality dimensions. These metrics work best when you compare systems against each other, track trends over time, or assess whether specific changes improved performance. You can compare systems against each other, track trends over time, and identify which changes helped or hurt. Evaluation tells you where you stand and whether you are improving.
Testing gates deployment
Testing enforces correctness through binary pass/fail checks. Tests act as deployment gates: if any test fails, the system cannot ship. This approach catches regressions and ensures baseline quality standards are met before code reaches production. Testing answers a single question: is it safe to ship?
When teams misuse metrics as tests
Confusion happens when you treat evaluation metrics like tests. Setting arbitrary thresholds and blocking deployment on fuzzy scores creates workflow bottlenecks. A faithfulness score of 0.7 may be excellent for one use case and inadequate for another. Treating it as a hard gate ignores the context that makes metrics meaningful.
For deterministic, must-pass criteria like valid output formats or blocked content, use tests. For subjective quality dimensions, use evaluation metrics to measure and improve performance over time. Tests should gate deployment by validating against a curated test set. Evaluation metrics should guide iteration by tracking how your AI agent improves. This separation keeps your workflow efficient: Tests catch regressions that would break production, while metrics help you understand whether changes actually improve the qualities you care about.
Offline vs. online evaluation
Where you run evaluations matters as much as what you measure. Offline and online evaluation serve different purposes in the development lifecycle. They form the foundation of any AI agent evaluation framework.
Comparing offline and online evaluation:
Offline evaluation for controlled testing
Offline evaluations target examples from datasets. These curated test cases include reference outputs (the known-correct answers) that define what "good" looks like. You build these datasets from production traces that expose interesting behavior, expert-annotated examples, or synthetic data that covers edge cases. Offline evaluation lets you test changes before they reach users.
Online evaluation for production monitoring
Online evaluations target runs and threads from tracing. These are real production traces without reference outputs. You configure evaluators to run on a sampled subset of live traffic. This provides real-time feedback on how your AI agent performs in the wild. Online evaluation catches issues that curated datasets miss, like unexpected user inputs, distribution shifts, and emergent failure modes. This continuous visibility into production behavior is a core component of AI observability.
Why you need both
Offline evaluation without online monitoring leaves you blind to production behavior. Online evaluation without offline regression testing means you cannot safely iterate. LangSmith supports both approaches. Automations convert production traces into dataset examples to close the loop between observing problems and testing fixes.
Evaluations with LangSmith

LangSmith's evaluations feature brings together offline and online evaluation into a single unified system. This integrated approach allows you to evaluate curated datasets before deployment and monitor live production traffic continuously. Evaluation covers three distinct modes:
- Offline evals on curated datasets
- Online evals on production traffic
- LLM-as-a-judge evaluation that scores subjective quality criteria using rubric prompts
Each mode serves a different point in the development lifecycle, and each feeds into the next. You run offline evals before shipping to catch regressions. Online evals run continuously in production to surface drift and failure modes. LLM-as-a-judge evaluators measure subjective quality dimensions like tone, faithfulness, and task completion that deterministic checks cannot capture.
Evaluating multi-turn AI agent conversations
Single-turn metrics evaluate individual responses in isolation. When users interact with AI agents over multiple turns, quality depends on the conversation as a whole.
Threads are collections of related runs representing multi-turn conversations. A user asks a question, the agent responds, the user follows up, and the agent retrieves additional context. Each turn might look fine in isolation, but the thread can still fail. The agent might forget context from earlier turns, contradict itself, or take the user in circles without completing the task.
Online evaluators can run at the thread level to evaluate entire conversations rather than individual turns. Thread-level evaluation measures what actually matters for AI agent performance:
- Task completion: Captures whether the agent accomplished what you asked for.
- User outcome: Indicates whether the conversation ended successfully from your perspective.
- Agent trajectory: Reveals whether the agent took a reasonable path, including appropriate tool calls.
Thread-level metrics require different evaluator design. Instead of scoring a single response, the evaluator sees the full conversation history and assesses the interaction holistically. Multi-turn LLM-as-a-judge evaluators score semantic intent, outcomes, and trajectory once a thread completes.
This video tutorial demonstrates LangSmith's multi-turn evaluations, showing how to measure agent performance across entire conversations. You'll learn to configure evaluators for semantic intent, outcomes, and agent trajectory using threads.
RAG evaluation metrics
Retrieval-augmented generation systems have two components that must work together effectively. First, the retriever must find relevant context from your knowledge base. Second, the generator must produce accurate responses using that context. Each component can fail independently. When retrieval is weak, the model lacks the information it needs to answer correctly. When generation is weak, the model fails to properly use the context it receives, potentially ignoring good information or misinterpreting it.
Retrieval metrics
Retrieval metrics assess whether you find the right documents:
- Context precision: Measures whether retrieved documents are relevant to the query. High precision means less noise in the context window.
- Context recall: Measures whether you retrieve all the documents that should be retrieved. High recall means you capture all relevant information.
Generation metrics
Generation metrics assess whether the response uses context appropriately:
- Faithfulness: Measures whether the response is factually consistent with the retrieved context. Low faithfulness indicates hallucination, a direct failure of factual accuracy, where the model generates claims not supported by the sources.
- Answer relevance: Measures whether the response actually addresses the user's question. A response can be faithful to the context but miss what the user asked.
Combining metrics for diagnosis
You can compute these metrics using LLM-as-a-judge evaluators or code-based evaluators depending on the criterion. Faithfulness typically requires LLM-as-a-judge because it involves semantic reasoning about whether claims are supported. Format checks can use deterministic code evaluators.
A RAG system with high precision but low faithfulness retrieves good documents but hallucinates anyway. High faithfulness with low recall means the model stays grounded but misses information. By measuring both retrieval and generation separately, you can identify which component is causing problems and target your improvements more effectively.
AI agent evaluation metrics
AI agent evaluation extends beyond response quality to include autonomous behavior like tool use, planning, and multi-step execution. When AI agents call tools, make decisions, and loop until tasks complete, you need metrics that capture whether the agent makes good choices in response to user input.
Task completion is the outcome metric that matters most. Tool correctness and trajectory serve as diagnostic tools to help you understand why task completion fails. Rather than treating all three as equally important, think of them as a hierarchy: start with whether the task succeeded, then use the other two metrics to diagnose failures.
Task completion
Task completion measures whether the AI agent achieved the user's goal. This is the metric that matters the most. A customer support agent might produce eloquent responses, but if it leaves tickets unresolved, it fails at its core purpose. Success looks different across domains, so you need to define what task completion means for your specific use case.
Tool correctness
Tool correctness assesses whether the AI agent selects the right tools and passes correct arguments. An agent that calls the wrong API or passes malformed parameters might seem functional at the response level but fails at the capability level. Tool correctness metrics catch these failures before they compound.
Trajectory assessment
Trajectory assessment evaluates the path the AI agent took to reach its destination. Did the agent take unnecessary steps? Did it get stuck in loops? Did it make reasonable decisions at each branch point? Trajectory metrics help you understand exactly how the agent succeeded or failed.
Multi-turn evaluators score these criteria across entire conversations to capture agent behavior that single-turn metrics miss. When you debug why an AI agent failed a complex task, trajectory visibility shows you where the agent went wrong. You can see which tool call returned unexpected results or which reasoning step led to a dead end.
Custom metrics for business-specific criteria
Pre-built evaluators measure standard criteria like hallucination and correctness. Your business likely has quality requirements that no generic benchmark captures. Standard benchmarks tell you how a model performs on general tasks. Custom metrics tell you whether your AI agent meets your specific quality bar.
You can define LLM-as-a-judge evaluators with rubric prompts that operationalize business-specific quality dimensions. The prompt template tells the evaluator exactly what to look for and how to score it.
Example: Tone evaluator
from openevals.llm import create_llm_as_judge
tone_evaluator = create_llm_as_judge(
prompt="""Evaluate whether the response maintains the appropriate tone for customer support:
- Professional but warm
- Empathetic when addressing problems
- Clear without being condescending
Score 1 if tone is appropriate, 0 if not. Explain your reasoning.
Inputs: {inputs}
Outputs: {outputs}
Reference: {reference_outputs}""",
feedback_key="tone_appropriateness",
model="openai:o3-mini",
)Example: Toxicity detection evaluator
from openevals.llm import create_llm_as_judge
toxicity_evaluator = create_llm_as_judge(
prompt="""Check if the response contains any toxic, harmful, or inappropriate content:
- Offensive language
- Discriminatory statements
- Aggressive or hostile tone
Score 1 if toxicity is present (bad), 0 if no toxicity detected. Explain any concerns.
Inputs: {inputs}
Outputs: {outputs}
Reference: {reference_outputs}""",
feedback_key="toxicity_present",
model="openai:o3-mini",
)Example: Summarization quality evaluator
from openevals.llm import create_llm_as_judge
summary_evaluator = create_llm_as_judge(
prompt="""Evaluate the quality of this summary:
- Does it capture the key points?
- Is it appropriately concise?
- Does it avoid introducing information not in the source?
Score on a scale of 0-1. Explain your reasoning.
Inputs: {inputs}
Outputs: {outputs}
Reference: {reference_outputs}""",
feedback_key="summary_quality",
model="openai:o3-mini",
continuous=True,
)Example: Domain accuracy evaluator
from openevals.llm import create_llm_as_judge
domain_evaluator = create_llm_as_judge(
prompt="""Evaluate whether the response contains medically accurate information based on the provided clinical guidelines context.
Score on a scale of 0-1:
- 1.0: Fully accurate, consistent with guidelines
- 0.5: Partially accurate, some unsupported claims
- 0.0: Contains inaccurate medical information
Cite specific guideline sections that support or contradict the response.
Inputs: {inputs}
Outputs: {outputs}
Reference: {reference_outputs}""",
feedback_key="medical_accuracy",
model="openai:o3-mini",
continuous=True,
)ServiceNow's experience shows how custom evaluators transform theoretical evaluation into practical quality control. They developed rigorous custom evaluation metrics tailored to each AI agent's specific task. Their evaluators use LLM-as-a-judge to score criteria that generic benchmarks cannot capture. To ensure these automated scores remain accurate, ServiceNow validates them through human annotation using annotation queues. This validation process confirms that the judge's assessments align with expert judgment from domain specialists.
Custom criteria definition makes evaluation metrics practical for production systems. Organizations have specific quality requirements: brand-appropriate tone, regulatory compliance, and accuracy on internal knowledge bases. Pre-built evaluators cannot capture these domain-specific needs. Custom evaluators encode the quality standards that matter for each application.
When custom metrics aren't the right call
Not every evaluation problem needs a custom LLM-as-a-judge evaluator. Before you invest time writing rubric prompts and calibrating scores, consider whether a simpler approach gets you there.
Use standard deterministic checks when the criterion is binary. If you need to verify that output is valid JSON, that a response stays under a token limit, or that no PII appears in a customer-facing message, a code evaluator is faster, cheaper, and perfectly consistent. Writing an LLM-as-a-judge prompt for something you can check with a regex or schema validator adds cost and variability for no benefit.
Use pre-built evaluators when the criterion is well-established. We maintain a set of pre-built evaluators in openevals for common criteria like correctness, conciseness, and hallucination detection. If your quality dimension maps cleanly to one of these, start there. You can always fork and customize later if you need tighter alignment with your domain.
Use human review when you're still defining "good." Custom evaluators encode a rubric, but writing a good rubric requires knowing what quality looks like for your use case. If your team can't yet articulate the difference between a 0.6 and a 0.8 on a given criterion, start with human annotation through LangSmith's Annotation Queues. Collect labeled examples first, then use them to build and calibrate your automated evaluator.
Custom LLM-as-a-judge evaluators earn their keep when:
- Your quality criterion is subjective and domain-specific (brand tone, clinical accuracy, legal compliance)
- You need to evaluate at a scale that makes human review impractical
- You've already defined what "good" looks like through examples or a written rubric
- You want a fast feedback signal in a continuous improvement loop
The goal isn't to automate everything, it's to automate the things where you've already built enough understanding to write a clear rubric, and keep humans in the loop where you haven't.
LLM-as-a-judge limitations and mitigations
LLM-as-a-judge evaluators use a language model to score outputs based on rubric prompts. These judges are powerful but imperfect. They exhibit known biases and inconsistencies that can undermine evaluation reliability when you use them for automated assessment.
Known biases
Position bias: LLM judges tend to favor responses that appear first in their context window or last. The same response can receive different scores depending on where it appears.
Verbosity bias: Judges often rate longer responses higher, even when concise answers would be more appropriate. A verbose response can outscore a correct, terse one.
Inconsistency: The same judge with the same prompt can score the same response differently across runs. This variability makes evaluation results noisy if you do not average them over multiple runs.
Mitigation strategies
Constrained scoring: Use boolean or categorical feedback instead of open-ended continuous scores. Constrained outputs are more consistent and reduce the judgment surface area where bias can enter.
Pairwise comparisons with position swaps: When comparing two outputs, run the comparison twice with positions reversed and aggregate the results. This cancels out position bias.
Calibration against human labels: Create a small set of human-labeled examples and measure how well your judge agrees with human judgment. Tune the prompt until alignment improves. LangSmith's few-shot evaluators support this pattern directly.
Multiple judge runs: Run the same evaluation multiple times and average results to reduce noise from variability.
The most practical approach is to view judge scores as informative signals rather than absolute truth. Before relying on automated judges at scale, validate their assessments against human judgment for your specific evaluation criteria.
Traces drive progress: the agent improvement loop
Evaluation becomes actionable when it closes a loop. The goal is building a system where production behavior drives systematic improvement through a data flywheel.
How the agent improvement loop works
Production traces surface real user interactions and failure modes. LangSmith’s Insights Agent identifies usage pattern insights from these traces. You can add interesting or problematic traces to datasets manually or automatically via LangSmith Automations. These datasets become the ground truth for offline evaluation.
You run evaluations against these datasets using LLM-as-a-judge, code-based, or custom evaluators. The evaluation results reveal what works and what needs improvement. This insight drives iteration on prompts, retrieval strategies, or AI agent architecture. Fine-tuning your approach based on evaluation data creates a faster cycle of improvement.
New changes get tested against the datasets before shipping. Once deployed, online evaluation monitors whether the changes helped or hurt. As your AI agent generates responses in production, new traces feed back into the data flywheel.
Why the agent improvement loop matters
This is the agent improvement in action: production data fuels testing, and a problematic trace identified in production can be added to a dataset with a single click. That dataset becomes ground truth for regression testing, where fixed bugs remain fixed.
The agent improvement loop connects offline and online evaluation into a continuous improvement system. Without this cycle, offline evaluation drifts from production reality. Without offline testing, online evaluation only catches problems after users hit them. Together, both modes become more valuable than either one alone.
Start measuring what your business actually needs
Custom evaluation metrics bridge the gap between generic benchmarks and business requirements. They measure the specific quality dimensions your AI agent must satisfy: appropriate tone for your brand voice, compliance with regulatory standards, accuracy on your domain knowledge, and successful completion of user tasks. When you define metrics that align with product requirements, evaluation becomes a tool for improvement rather than an academic exercise.
LangSmith provides the infrastructure to make this practical. Structured feedback captures both scores and reasoning, making evaluation results actionable. Offline datasets enable regression testing so fixed bugs stay fixed. Online evaluators monitor production traffic to catch issues before they compound. Automations connect these components into a continuous data flywheel where production traces drive systematic improvement.
Start by identifying one business-specific quality criterion that your current LLM evals miss. Define a custom evaluator for it using an LLM-as-a-judge rubric or code-based logic and run it against a small dataset of real traces. The results will show you whether your AI agent meets the quality bar that matters for your users. Explore LangSmith Evaluations to get started.