Harden your application with LangSmith evaluation

Don’t ship on “vibes” alone. Measure your LLM application's
performance by testing across its development lifecycle.

Continuously improve your LLM system by capturing new metrics for style and accuracy, identifying regressions and errors, and fixing them quickly.

Test early, test often

LangSmith helps test application code
pre-release and while it runs in production.

Offline Evaluation

Test your application on reference LangSmith datasets. Use a combination of human review and auto-evals to score your results.

Integrate with CI

Understand how changes to your prompt, model, or retrieval strategy impact your app before they hit prod. Catch regressions in CI and prevent them from impacting users.

Online evaluation

Continuously monitor qualitative characteristics of your live application to spot problems or drift.
001

Dataset Construction

A strong testing framework starts with building a reference dataset, often a tedious task. LangSmith streamlines this by letting you save debugging and production traces to datasets.

Datasets are collections of exemplary or problematic inputs and outputs that should be replicated or corrected, respectively.

Go to Docs
002

Regression Testing

When there are so many moving parts to an LLM-app, it can be hard to attribute regressions to a specific model, prompt, or other system change. LangSmith lets you track how different versions of your app stack up based on the evaluation criteria that you’ve defined.

Go to Docs
003

Human Annotation

While LangSmith has many options for automatic evaluation, sometimes you need a human touch. LangSmith speeds up the human labeler workflow significantly by supporting a feedback config and queue of traces that users can easily work through by annotating application responses with scores.

004

Online Evaluation

Testing needs to happen continuously for any live application. LangSmith helps you monitor not only latency, errors, and cost, but also qualitative measures to make sure your application responds effectively and meets company expectations.

Don’t fly blind. Easily benchmark performance.

Evaluation gives developers a framework to make trade-off decisions between cost, latency, and quality.

Go to Docs

Ready to start shipping 
reliable GenAI apps faster?

Get started with LangChain, LangSmith, and LangGraph to enhance your LLM app development, from prototype to production.