LLM Observability Tools to Monitor & Eval Agents

A breakdown of the leading 8 LLM observability platforms for agent debugging, tracing, and evaluation.

Table of contents

8 LLM Observability Tools to Monitor & Eval AI Agents

The best LLM observability tools at a glance

What to look for in an LLM observability tool?

Get started with LangSmith

8 LLM Observability Tools to Monitor & Eval AI Agents

Error logs tell you what broke. They don't flag hallucinations or when the model drifts from its intended behavior.

Basic monitoring catches obvious failures. The harder problem to solve is building workflows where subject matter experts can review specific runs, rate output quality, and add context that engineering teams can act on.

The best observability tools turn this feedback into a structured process, not scattered Slack alerts and spreadsheets.

This guide compares eight LLM observability platforms on their tracing, evaluation, and collaboration features.

While some focus on traditional infrastructure-level monitoring, others help you answer the harder question: whether your LLM agent actually produces good outputs for your specific use case.‍

Summary

Choose LangSmith if you need comprehensive agent debugging, observability, and evals, with structured workflows for domain experts to review and annotate production traces.
Choose Datadog LLM Observability if you need unified infrastructure and LLM monitoring within an existing Datadog stack.
Choose Langfuse if you need an open-source, self-hostable platform combining observability with prompt management.
Choose Helicone if you need a low-latency proxy that adds observability and caching with minimal code changes.

The bottom line: The best observability tools go beyond showing errors and create feedback loops where subject matter experts can contribute domain knowledge that engineers can actually use.

Get a demo of LangSmith's agent engineering platform

The best LLM observability tools at a glance

Tool	Type	Pricing	Open Source	LangChain Integration	Best For
LangSmith	Observability and Evaluation Platform	Freemium (from $0/seat/mo)	No	Native	Agent debugging, production monitoring, and LLM-as-judge evaluations
Datadog LLM Observability	Observability	Contact Sales	No	Yes	Unified infrastructure and LLM monitoring
Lunary	Observability & Prompt Management	Freemium (from $0/mo)	Yes (Apache-2.0)	Yes	Lightweight RAG pipeline and chatbot observability
Helicone	LLM Observability & AI Gateway	Freemium (from $0/mo)	Yes (Apache-2.0)	Native	Low-latency proxy for observability and caching
Langfuse	LLM Engineering Platform	Freemium (from $0/mo)	Yes (MIT)	Native	All-in-one observability, prompts, and evaluations
TruLens	Observability & Evaluation	Free (Open Source)	Yes (MIT)	Community	RAG Triad evaluation metrics
Arize Phoenix	AI Observability & Evaluation	Freemium (from $0/mo)	Yes (ELv2)	Native	Local-first RAG observability and evaluation
Portkey	AI Gateway / LLM Routing	Freemium (from $49/mo)	Yes (MIT)	Native	Production gateway with routing and fallbacks

What to look for in an LLM observability tool?

Catching errors is table stakes. The real challenge is knowing when outputs are technically valid but wrong for your domain. The best LLM observability tools surface these ambiguous cases for human review.

Tracing depth matters. You need visibility into every step of complexagent workflows: tool calls, retrieved documents, and intermediate reasoning. Black-box monitoring doesn't work for multi-step agents.

The best tools close the loop between production and development. They let domain experts annotate specific runs, then turn that feedback into evaluation datasets.

How we evaluated these tools

We analyzed official documentation, GitHub repositories, and public pricing pages for each platform. We gathered community sentiment from Reddit, Hacker News, and GitHub discussions, because real user feedback often surfaces nuances that official docs don't.

For this analysis, we focused on tracing, evaluation features, integration flexibility, and collaboration workflows.

LangSmith is built by LangChain, who publishes this guide. We believe in LangSmith, but we have done our best to give every tool here a fair assessment. If LangSmith is not the right fit, one of these alternatives probably is.

LangSmith

What is LangSmith?

Quick Facts:

Type: Observability and Evaluation Platform
Company: LangChain
Pricing: Free tier available; Plus at $39/seat/month; customer Enterprise pricing
Open Source: No
Website: https://smith.langchain.com

Q: Can subject matter experts who aren't engineers use LangSmith? Yes. Annotation Queues are designed for domain experts to review specific traces, rate output quality, and add context without needing engineering skills. This feedback integrates directly into evaluation datasets.

LangSmith is a unified agent engineering platform that provides observability, evaluations, and prompt engineering for any LLM application or AI agent. LangSmith is framework agnostic and works with any agent stack. It works with any framework, such as: OpenAI SDK, Anthropic, custom implementations, LangChain, or LangGraph.

The platform creates high-fidelity traces that render the complete execution tree of an agent. You see tool selections, retrieved documents, and exact parameters at every step.

LangSmith turns observability into a collaborative workflow. Annotation Queues let subject matter experts review, label, and correct complex traces. This domain knowledge flows directly into evaluation datasets, creating a structured feedback loop between production behavior and engineering improvements.

LangSmith also allows you to run offline evaluations on datasets, or run online evaluations on production traffic. The platform allows for automated scoring performance with evaluators, including LLM-as-judge.

Who should use LangSmith?

Engineering teams building complex agents or multi-step workflows who need deep visibility into execution steps.
Teams where domain experts (not just engineers) need to review and rate LLM outputs through annotation and evaluation workflows.
Organizations running millions of traces per day who need cost and latency attribution at the step level.

Standout features

Full-stack tracing that captures the "internal monologue" of agents, including tool calls, document retrieval, and model parameters.
Polly, an embedded AI debugging assistant that analyzes traces and answers natural language questions like "Why did the agent enter this loop?"
Annotation Queues that create structured workflows for human experts to review, label, and correct production traces.
LLM-as-a-judge evaluators that automatically grade thousands of historical runs using custom criteria.
Multi-turn evaluation support measuring agent performance across entire conversation threads.
Insights Agent that auto-categorizes behavior patterns and prioritizes improvements by frequency and impact.

Pros and cons

Pros	Cons
Deep visibility into complex agent workflows with step-level cost and latency attribution	Self-hosting restricted to Enterprise tier
Structured evaluation workflows with LLM-as-judge and human annotation capabilities	BAA signing restricted to Enterprise tier
Works with any LLM framework, not just LangChain	Product support only for Plus tier and above

FAQ

Q: Does LangSmith only work with LangChain applications? No. LangSmith is a standalone platform that works with any LLM framework, including OpenAI SDK, Anthropic, and custom implementations. It uses a traceablewrapper for automatic instrumentation regardless of your stack.

Q: How does LangSmith handle high-volume production traffic? LangSmith processes millions of traces per day for enterprise customers. The platform offers 14-day retention for base traces and 400-day extended retention, with volume-based pricing that scales with usage.

Q: What evaluation approaches does LangSmith support? LangSmith supports offline evals (testing known scenarios before production), online evals (testing over real-time production data), and multi-turn evaluations for conversation-based agent applications. You can use LLM-as-judge evaluators or human annotation workflows.

Datadog

What is Datadog LLM Observability?

Quick Facts:

Type: Observability
Company: Datadog, Inc.
Pricing: Contact for pricing
Open Source: No
Website: https://www.datadoghq.com/product/llm-observability/

Datadog LLM Observability extends Datadog's existing monitoring platform to cover LLM applications. It correlates LLM spans with standard APM traces, showing how model latency affects overall application performance.

The platform supports agentless deployment via environment variables. This makes it accessible for serverless environments. Teams can view LLM performance alongside infrastructure metrics, error rates, and traditional application monitoring. Datadog provides Jupyter notebook examples for common patterns like RAG pipelines and agents.

Who should use Datadog LLM Observability?

Teams already invested in the Datadog ecosystem who want to add LLM monitoring without adopting a new platform
Organizations that prioritize correlating LLM performance with infrastructure metrics

Standout features

Correlation between LLM spans and standard APM traces for end-to-end latency analysis
Agentless deployment mode for serverless and restricted environments
Pre-built Jupyter notebooks demonstrating RAG pipeline and agent instrumentation
Automatic instrumentation of LangChain applications via dd-trace-py

Pros and cons

Pros	Cons
Unified view of LLM and infrastructure metrics in a single platform	Feels like infrastructure monitoring bolted onto LLMs; lacks specialized evaluation features
Familiar interface for teams already using Datadog	Payload size limits (~1 MB) can cause dropped spans
Agentless mode simplifies deployment in restricted environments	Pricing transparency concerns; known to be expensive at scale

FAQ

Q: Do I need the Datadog Agent to use LLM Observability? No. Datadog supports an agentless mode via environment variables, though running the full agent provides additional capabilities.

Q: How does Datadog LLM Observability compare to dedicated LLM tools? Community feedback suggests it excels at infrastructure correlation but lacks the depth of evaluation and annotation features found in purpose-built LLM platforms. It's best for teams prioritizing unified monitoring over specialized LLM workflows.

Q: What metadata can I attach to LLM spans? Datadog supports basic tags like temperature and model parameters. Users note that metadata support is less flexible than some dedicated alternatives.

Q: Is pricing publicly available? No. Contact Datadog sales for pricing information. Community members have expressed concerns about potential cost increases at scale.

Lunary

What is Lunary?

Quick Facts:

Type: Observability & Prompt Management Platform
Company: Lunary LLC
Pricing: Free tier (10k events/month); Team and Enterprise tiers contact for pricing
Open Source: Yes (Apache-2.0)
Website: https://lunary.ai

Lunary is a lightweight observability platform focused on RAG pipelines and chatbots. Setup takes about two minutes. It offers SDKs for JavaScript (Node.js, Deno, Vercel Edge, Cloudflare Workers) and Python.

The platform provides specialized tracing for retrieval-augmented generation, including embedding metrics and latency visualization. The generous free tier (10k events/month with 30-day retention) makes Lunary accessible for early-stage projects. Its open-source core (Apache-2.0) allows self-hosting, though some features require Enterprise licensing.

Who should use Lunary?

Teams building RAG pipelines or chatbots who need cost-effective observability without enterprise complexity
Startups and small teams looking for a generous free tier to get started
Developers working with JavaScript runtimes like Deno, Vercel Edge, or Cloudflare Workers

Standout features

Rapid two-minute integration via lightweight SDKs
Specialized RAG tracing with embedding metrics and latency heatmaps
JavaScript SDK designed for compatibility with LangChain JS
Generous free tier with 10k events/month and 30-day retention

Pros and cons

Pros	Cons
Fast setup and lightweight SDKs across multiple JavaScript runtimes	Advanced features like playground and evaluators limited in lower tiers
Specialized RAG visualization features	Self-hosting requires Enterprise license for some features
Cost-effective compared to enterprise alternatives	Limited support for tracing images/attachments

FAQ

Q: What JavaScript runtimes does Lunary support? Lunary's JavaScript SDK works with Node.js, Deno, Vercel Edge, and Cloudflare Workers.

Q: Can I self-host Lunary? The core is open source under Apache-2.0, but convenient Docker/Kubernetes configurations and some compliance features require an Enterprise license.

Q: Does Lunary support exporting traces for fine-tuning? Users have reported gaps in dataset integration for exporting traces to fine-tuning workflows. Check current documentation for the latest capabilities.

Q: What's included in the free tier? 10k events/month, 3 projects, and 30 days of log retention.

Helicone

What is Helicone?

Quick Facts:

Type: LLM Observability & AI Gateway
Company: Helicone, Inc.
Pricing: Free tier (10k requests/month); Pro tier with 7-day trial; Enterprise contact sales
Open Source: Yes (Apache-2.0)
Website: https://www.helicone.ai

Helicone is a proxy-based observability solution. It sits between your application and LLM providers. Swap your API's base URL, and you gain observability, caching, and cost tracking with minimal code changes.

The platform adds negligible latency overhead according to Helicone’s own docs, making it suitable for production workloads.

The AI Gateway supports 100+ models across OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, Gemini, and more. Intelligent caching and automatic failover help reduce API costs and improve reliability. The fully open-source core supports managed cloud, self-hosted Docker, and enterprise Helm chart deployments.

Who should use Helicone?

Teams wanting observability without complex SDK integration
Organizations prioritizing low latency overhead in production
Teams needing caching and failover capabilities alongside monitoring
Developers who prefer proxy-based approaches over code instrumentation

Standout features

One-line integration by swapping the API base URL
Low latency overhead suitable for production
Intelligent caching and automatic failover across providers
Support for 100+ models via unified gateway
Fully open-source core with flexible deployment options

Pros and cons

Pros	Cons
Minimal code changes required; proxy-based approach	Missing advanced governance features like granular RBAC and audit trails
Cost-saving caching reduces API spend	Self-hosted setups can be complex in Kubernetes environments
Open-source with multiple deployment options	Less depth in evaluation features compared to specialized platforms

FAQ

Q: How much latency does Helicone add? Negligible latency overhead, which users report is acceptable for most production workloads.

Q: What LLM providers does Helicone support? OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, Gemini, Ollama, Vercel AI, Groq, and 100+ additional models.

Q: Can I self-host Helicone? Yes. The open-source core supports Docker and Helm chart deployments.

Q: Does Helicone work with LangChain? Yes. Helicone provides a LangChain provider that routes calls through its gateway while maintaining observability and caching features.

Langfuse

What is Langfuse?

Quick Facts:

Type: LLM Engineering Platform
Company: Acquired by Clickhouse. Continued product investment is uncertain
Pricing: Free tier (50k units/month, 2 users); Enterprise from $2,499/month
Open Source: Yes (MIT, except ee folders)
Website: https://langfuse.com

Langfuse combines observability, prompt management, and evaluations in a single platform. The MIT-licensed core makes it popular with teams wanting full control over their data through self-hosting.

Automated instrumentation via callback handlers captures traces without modifying business logic. Community adoption is strong, with over 21,000 GitHub stars. The platform supports OpenAI SDK, LangChain, LlamaIndex, LiteLLM, Vercel AI SDK, Haystack, and Mastra.

Who should use Langfuse?

Teams seeking an open-source, self-hostable alternative to proprietary platforms
Organizations wanting observability, prompt management, and evaluations in one place
Developers using Python or TypeScript who value drop-in SDK integration

Standout features

Unified platform combining observability, prompt management, and evaluations
MIT-licensed core with Docker-based self-hosting options
Automated instrumentation via LangChain callback handlers
Support for multiple frameworks: OpenAI SDK, LlamaIndex, LiteLLM, Vercel AI SDK, Haystack, Mastra
21,000+ GitHub stars as of February 2026, indicating strong community adoption

Pros and cons

Pros	Cons
All-in-one platform reduces tool fragmentation	Self-hosted version has occasional bugs with webhooks, dataset runs, and filtering
MIT license provides flexibility for self-hosting	Native SDK support limited to Python and TypeScript
Strong community and active development	Enterprise tier starts at $2,499/month. Continued product support and investment uncertain with Clickhouse acquisition

FAQ

Q: Is Langfuse fully open source? The core is MIT-licensed. Enterprise features in ee folders have separate licensing. Check the repository for current details.

Q: What languages does Langfuse support? Native SDKs exist for Python and TypeScript. Other languages require building wrappers around the API.

Q: How does self-hosting work? Langfuse provides Docker-based deployment options. Users report occasional bugs in the self-hosted version, so factor in maintenance time.

Q: Can I use Langfuse with LangChain? Yes. Langfuse provides a callback handler for automated instrumentation of LangChain applications.

TruLens

What is TruLens?

Quick Facts:

Type: Observability & Evaluation for LLM Applications
Company: TruEra (Acquired by Snowflake)
Pricing: Open source (free); commercial/enterprise contact for pricing
Open Source: Yes (MIT)
Website: https://www.trulens.org

TruLens focuses on systematic evaluation of RAG pipelines. It uses the "RAG Triad" framework: Context Relevance, Answer Relevance, and Groundedness. These metrics measure whether retrieved context is relevant, whether answers address the question, and whether responses are grounded in provided context.

The platform integrates with experiment tracking tools like Weights & Biases. Teams can log evaluation tables and A/B test model-prompt combinations. TruLens provides "chain-aware" feedback functions that capture metadata and intermediate steps better than ad-hoc evaluation scripts.

Who should use TruLens?

Teams building RAG applications who need structured evaluation metrics
ML engineers who want to integrate evaluation with experiment tracking platforms like Weights & Biases
Organizations focused on measuring groundedness and context relevance specifically

Standout features

RAG Triad evaluation framework: Context Relevance, Answer Relevance, Groundedness
Chain-aware feedback functions capturing intermediate steps and metadata
Integration with Weights & Biases for experiment tracking
OpenTelemetry support for broader observability integration

Pros and cons

Pros	Cons
Comprehensive RAG-specific metrics via the RAG Triad framework	Similarity-based metrics can yield high scores for contextually incorrect outputs
Deep instrumentation captures intermediate steps	Continuous scores make it difficult to write clear pass/fail assertions
Experiment tracking integration with Weights & Biases	Configuration challenges with advanced LangChain agents and streaming callbacks

FAQ

Q: What is the RAG Triad? Three metrics for evaluating RAG pipelines: Context Relevance (is the retrieved context relevant?), Answer Relevance (does the answer address the question?), and Groundedness (is the response grounded in the provided context?).

Q: How does TruLens handle false positives in evaluation? Similarity-based metrics can sometimes yield high scores for contextually incorrect outputs.

Q: Does TruLens integrate with LangChain? Yes, though as a community-supported integration. TruEra provides a migration guide for incorporating TruLens into LangChain v1.x applications.

Q: Is TruLens free? The open-source library is free under MIT license. Commercial/enterprise pricing requires contacting TruEra.

Arize Phoenix

What is Arize Phoenix?

Quick Facts:

Type: AI Observability & Evaluation
Company: Arize AI, Inc.
Pricing: Open source (free self-hosted); AX Free tier (25k spans/month); AX Pro/Enterprise contact sales
Open Source: Yes (Elastic License 2.0)
Website: https://phoenix.arize.com/

Arize Phoenix emphasizes local-first, notebook-friendly observability. It runs locally, in Jupyter notebooks, or via Docker with zero external dependencies. Privacy-focused teams find this attractive.

The platform uses OpenInference (OpenTelemetry-based) instrumentation to support multiple frameworks without vendor lock-in. Phoenix supports LlamaIndex, LangChain, Haystack, DSPy, and smolagents. The notebook-first experience lets ML engineers trace and visualize data directly during experimentation, shortening feedback loops before production deployment.

Who should use Arize Phoenix?

ML engineers who work primarily in Jupyter notebooks and want observability during experimentation
Privacy-focused teams requiring fully local observability with no external dependencies
Teams using multiple frameworks (LlamaIndex, Haystack, DSPy) who want vendor-agnostic instrumentation

Standout features

Local-first deployment: runs in Jupyter, locally, or via Docker with zero external dependencies
Notebook-friendly experience designed for ML engineering workflows
OpenInference instrumentation supports LlamaIndex, LangChain, Haystack, DSPy, smolagents
Vendor-agnostic approach using OpenTelemetry-based standards

Pros and cons

Pros	Cons
Runs fully locally with no external dependencies	Reported deployment challenges for remote hosting
Notebook-first design shortens experimentation feedback loops	Cost tracking focuses on tokens rather than dollar amounts
Vendor-agnostic instrumentation via OpenInference	Fewer default evaluation metrics compared to some competitors

FAQ

Q: Can Phoenix run completely locally? Yes. Phoenix can run in Jupyter notebooks, locally, or via Docker with zero external dependencies.

Q: What is OpenInference? OpenInference is an OpenTelemetry-based instrumentation standard that Phoenix uses. It enables vendor-agnostic tracing across multiple frameworks.

Q: What's the difference between Phoenix (open source) and AX (cloud)? Phoenix is the open-source, self-hosted version. AX provides managed cloud hosting with tiered limits: Free (25k spans/month), Pro, and Enterprise.

Q: Does Phoenix support cost tracking? Yes, Phoenix focuses on token-based cost tracking.

Portkey

What is Portkey?

Quick Facts:

Type: AI Gateway / LLM Routing Framework
Company: Portkey.ai
Pricing: Developer free (10k logs/month); Production $49/month; Enterprise contact sales
Open Source: Yes (MIT)
Website: https://portkey.ai

Portkey is primarily an AI Gateway. It handles routing, fallbacks, and load balancing for LLM applications. Its lightweight architecture (~122 KB footprint) adds sub-millisecond latency overhead, making it suitable for high-performance production environments.

Observability comes as a built-in feature of the gateway rather than the primary focus. Teams often adopt Portkey to replace custom LLM management code. The unified SDKs for JavaScript and Python handle failovers, retries, and routing logic that would otherwise require significant engineering effort.

Who should use Portkey?

Teams building production applications that need reliable routing, fallbacks, and load balancing
Organizations with custom LLM management code they want to simplify
Developers prioritizing gateway functionality who also want basic logging and observability

Standout features

High-performance gateway with ~122 KB footprint and sub-millisecond latency overhead
Automatic failovers, custom routing, retries, and load balancing
Unified SDKs (JavaScript, Python) simplify multi-provider management
Integration with LangChain, LlamaIndex, Autogen, and CrewAI

Pros and cons

Pros	Cons
Minimal latency overhead makes it suitable for production routing	Observability features feel secondary to gateway functionality
Built-in reliability features replace custom code	UI lacks polished evaluation suite for prompt benchmarking
MIT-licensed gateway with 10,000+ GitHub stars	Pricing unclear for high-volume enterprise use beyond standard plans

FAQ

Q: Is Portkey primarily an observability tool or a gateway? Portkey is primarily an AI Gateway. Observability (logging, tracing) is a built-in feature but not the primary focus. Teams needing deep evaluation workflows may want to pair it with a dedicated observability platform.

Q: How much latency does Portkey add? Sub-millisecond overhead with a ~122 KB footprint.

Q: Can Portkey replace custom LLM management code? Yes. Users report removing thousands of lines of custom failover, retry, and routing code by switching to Portkey's unified SDKs.

Q: What's included in the free Developer tier? 10k logs/month, 3 days log retention, and 3 prompt templates.

Get started with LangSmith

There’s a lot of strong options to choose from for LLM observability, but the right choice depends on the problem you're solving.

If you just need to know when things break, most tools listed here will work. But if your challenge is knowing when outputs are technically correct but wrong for your domain, you need more than monitoring. You need a workflow where subject matter experts can review specific runs, rate quality, and provide context that engineers can act on.

LangSmith is built for this feedback loop. Annotation Queues let domain experts review production traces without needing engineering skills. That feedback flows directly into evaluation datasets and allows teams to optimize their ai applications.

What you get:

Full-stack tracing that reveals the complete execution tree of any agent, regardless of framework.
Annotation Queues where subject matter experts review, label, and correct specific traces.
LLM-as-a-judge evaluators that automatically grade thousands of runs using your custom criteria.
Production-to-development workflows that turn real issues into systematic improvements.

LangSmith works with your existing stack, whether that's OpenAI SDK, Anthropic, custom implementations, or any orchestration tool. No migration required.

Get a demo of LangSmith's agent engineering platform

‍

The information provided in this article is accurate at the time of publication. Tool capabilities, pricing, and availability may change. Always verify current specifications on official websites.

Get started with agent observability & evals

LangSmith helps teams observe, evaluate, and deploy agents.

Get a demo