Harden your application with LangSmith evaluation

Don’t ship on “vibes” alone. Measure your LLM application's
performance by testing across its development lifecycle.

Get a demo Sign up for free

Continuously improve your LLM system by capturing new metrics for style and accuracy, identifying regressions and errors, and fixing them quickly.

Test early, test often

LangSmith helps test application code
pre-release and while it runs in production.

Offline Evaluation

Test your application on reference LangSmith datasets. Use a combination of human review and auto-evals to score your results.

Integrate with CI

Understand how changes to your prompt, model, or retrieval strategy impact your app before they hit prod. Catch regressions in CI and prevent them from impacting users.

Online evaluation

Continuously monitor qualitative characteristics of your live application to spot problems or drift.

Lots of options to ensure  full testing coverage

Evaluation is a critical, yet difficult, part of shipping quality applications. We make it easy to add automatic and human evaluation on every trace.

001

AI-Judge evaluation

Use an LLM and prompt to evaluate the response of your application – testing against any custom rubric.

002

Gold standard evaluation

Build up a labeled dataset of inputs and gold standard outputs in LangSmith, and then evaluate the similarity of your application’s response compared to the reference output.

003

Functional tests

Write a custom evaluator to test that the application’s response meets your expectations. For example, if you expect the response to be formatted in JSON, write a test to check for proper deserialization.

001

Dataset Construction

A strong testing framework starts with building a reference dataset, often a tedious task. LangSmith streamlines this by letting you save debugging and production traces to datasets.

Datasets are collections of exemplary or problematic inputs and outputs that should be replicated or corrected, respectively.

Go to Docs

002

Regression Testing

When there are so many moving parts to an LLM-app, it can be hard to attribute regressions to a specific model, prompt, or other system change. LangSmith lets you track how different versions of your app stack up based on the evaluation criteria that you’ve defined.

Go to Docs

003

Human Annotation

While LangSmith has many options for automatic evaluation, sometimes you need a human touch. LangSmith speeds up the human labeler workflow significantly by supporting a feedback config and queue of traces that users can easily work through by annotating application responses with scores.

004

Online Evaluation

Testing needs to happen continuously for any live application. LangSmith helps you monitor not only latency, errors, and cost, but also qualitative measures to make sure your application responds effectively and meets company expectations.