LangSmith Evaluation

Continuously improve agent quality

Run evals before and after shipping, gather expert feedback on real performance, and iterate on prompts with your team.

Helping top teams ship reliable agents

Evaluate your agent’s performance

Test with offline evaluations on datasets, or run online evaluations on production traffic. Score performance with automated evaluators — LLM-as-judge, code-based, or any custom logic— across criteria that matter to your business.

Learn how to run an eval

Iterate & collaborate on prompts

Experiment with models and prompts in the Playground, and compare outputs across different prompt versions or providers. Use the Prompt Canvas UI to auto improve prompts and compare results.

Create and test a prompt

Gather expert human feedback

Set up annotation queues, so subject-matter experts can assess response relevance, correctness, and other custom criteria. Automatically assign runs for review, and annotate any part of your agent workflow to capture precise feedback on quality.

Streamline feedback with annotation queues

Ready to build better agents through continuous evaluation?

LangSmith works with any framework to help you test and iterate faster. Run automated evals, gather expert feedback, and collaborate on improvements -- all without leaving your workflow.

FAQs for LangSmith Evaluation

What kind of evaluators does LangSmith support?
  • Human: Use annotation queues or inline review.
  • Heuristic: Rule-based checks like “is the response empty?” or “does the code compile?”
  • LLM-as-judge: Use an LLM to score outputs against criteria you define.
  • Pairwise: Compare two outputs to see which one is better.
Can I run both offline and online evaluations?

Yes. Offline evals run on datasets (great for benchmarking or regression testing). Online evals run on real production traffic in near real time and can be used to monitor deployed agents or LLM apps.

How does human feedback work?

LangSmith makes it easy to collect expert feedback through annotation queues. You can flag runs for review, assign them to SMEs (Small and medium-sized enterprises), and use that feedback to improve prompts or evaluators or augment datasets.

Can I use LangSmith Evaluation without LangSmith Observability?

Yes. You can use LangSmith Evaluation with or without Observability. For all plan types, you'll get access to both and only pay for what you use.

I can’t have data leave my environment. Can I self-host LangSmith?

Yes, we allow customers to self-host LangSmith on our enterprise plan. We deliver the software to run on your Kubernetes cluster, and data will not leave your environment. For more information, check out our documentation.

Where is my data stored?

When using LangSmith hosted at smith.langchain.com, data is stored in GCP us-central-1. If you’re on the Enterprise plan, we can deliver LangSmith to run on your kubernetes cluster in AWS, GCP, or Azure so that data never leaves your environment. For more information, check out our documentation.

Will LangSmith add latency to my application?

No, LangSmith does not add any latency to your application. In the LangSmith SDK, there’s a callback handler that sends traces to a LangSmith trace collector which runs as an async, distributed process. Additionally, if LangSmith experiences an incident, your application performance will not be disrupted.

Will you train on the data that I send LangSmith?

We will not train on your data, and you own all rights to your data. See LangSmith Terms of Service for more information.

How much does LangSmith cost?

See our pricing page for more information, and find a plan that works for you.