LangSmith Evaluation

Continuously improve agent quality

Run evals before and after shipping, gather expert feedback on real performance, and iterate on prompts with your team.

Helping top teams ship reliable agents

Evaluate your agent’s performance

Test with offline evaluations on datasets, or run online evaluations on production traffic. Score performance with automated evaluators — LLM-as-judge, code-based, or any custom logic— across criteria that matter to your business.

Learn how to run an eval

Iterate & collaborate on prompts

Experiment with models and prompts in the Playground, and compare outputs across different prompt versions or providers. Use the Prompt Canvas UI to auto improve prompts and compare results.

Create and test a prompt

Gather expert human feedback

Set up annotation queues, so subject-matter experts can assess response relevance, correctness, and other custom criteria. Automatically assign runs for review, and annotate any part of your agent workflow to capture precise feedback on quality.

Streamline feedback with annotation queues

Ready to build better agents through continuous evaluation?

LangSmith works with any framework to help you test and iterate faster. Run automated evals, gather expert feedback, and collaborate on improvements -- all without leaving your workflow.

FAQs for LangSmith Evaluation

What kind of evaluators does LangSmith support?

LangSmith's evaluation framework supports multiple evaluator types: human evaluation through annotation queues, heuristic checks (like validating outputs or checking if code compiles), LLM-as-judge evaluators that score against criteria you define, and pairwise comparisons. You can also write custom evaluators in Python or TypeScript with any business logic you need, from correctness and ground truth matching to hallucination detection and guardrails validation.

How does human feedback and annotation work?

LangSmith makes it easy for AI teams to collect expert feedback through annotation queues. Flag runs for review, assign them to subject-matter experts, and use that feedback to calibrate automated evaluation, improve prompts, or augment datasets with high-quality test cases.

How reliable is LLM-as-judge, and how do I audit it?

LLM-as-judge evaluators don't always get it right. LangSmith lets you route samples to human reviewers who flag disagreements, helping you identify failure modes and edge cases. This feedback loop lets you iterate on and calibrate your automated evaluation metrics over time.

What's the difference between offline and online evaluation?

Offline evaluation runs against curated datasets during development to catch regressions before deployment. They act act as unit tests for your LLM application. Online evaluation scores real-world production traffic in real-time to detect quality drift. LangSmith supports both as part of an end-to-end evaluation lifecycle.

Can I use LangSmith Evaluation without LangSmith Observability?

Yes. You can use LangSmith Evaluation with or without Observability. For all plan types, you'll get access to both and only pay for what you use.

How does LangSmith evaluate AI agents and multi-turn workflows?

Agent evaluation in LangSmith captures the full trajectory of steps, tool calls, and reasoning your agent took. Define evaluators that score intermediate decisions and agent behavior to debug complex agent workflows and pinpoint where things went wrong.

Can I run evaluations in my CI/CD pipeline?

Yes. LangSmith integrates with pytest, Vitest, and GitHub workflows so you can run evals on every PR or nightly build. Set thresholds on evaluation metrics and fail pipelines automatically when scores drop, bringing the same rigor as deterministic unit tests to your AI development process.

How do I benchmark across prompts, models, or agent versions?

LangSmith's comparison view dashboards show results side-by-side across experiments. Run the same dataset against different prompt versions, model providers, or agent systems to visualize what's working and optimize performance with real benchmarks.

How does LangSmith evaluate RAG systems?

RAG evaluation separates retrieval quality from generation quality. LangSmith supports metrics like context precision (did you retrieve relevant documents?) and faithfulness (does the answer match the retrieved context?) helping you catch hallucinations and improve your retrieval pipelines independently.

Do I have to use LangChain or LangGraph to use LangSmith?

No. LangSmith is framework-agnostic. Evaluate AI applications built with LangGraph, custom Python, or any other framework. Use the SDK or API to send traces from whatever stack your team runs.

How do I get started if I don't have a labeled dataset?

Start by capturing production traces with LangSmith, then sample interesting or problematic runs into a dataset. Use LLM-as-judge evaluators to bootstrap initial labels, then refine with human annotation.

Will LangSmith add latency to my application?

No. The LangSmith SDK uses an async callback handler that sends traces to a distributed collector. Your application performance is never impacted. If LangSmith experiences an incident, your agent keeps running normally.

Can I self-host LangSmith? Where is my data stored?

LangSmith instances hosted at smith.langchain.com stores data in GCP us-central-1 or europe-west4. For enterprise-grade requirements, LangSmith can run on your Kubernetes cluster in AWS, GCP, or Azure so its fully self-hosted and data never leaves your environment. We will not train on your data. See our documentation for details.

Will you train on the data that I send LangSmith?

We will not train on your data, and you own all rights to your data. See LangSmith Terms of Service for more information.

How much does LangSmith cost?

LangSmith has a free tier for development and small-scale production. Paid plans scale with trace volume. See our pricing page for details, or contact us for enterprise pricing.