comparison
Why "Claude-stack-only" is a feature. Metric AI vs Datadog/Honeycomb/Grafana, vs Langfuse/Helicone, vs Anthropic's own dashboards.
tl;dr
Metric AI is the OpenTelemetry backend purpose-built for the Claude stack — Claude Code, Claude Agent SDK, and Claude Cowork. We accept generic OTLP, but every dashboard, every aggregate, every renderer is shaped exactly like the data Anthropic’s tools emit. That’s the wedge.
If you want one backend for your Postgres, your Kubernetes nodes, and your Claude usage, you want Datadog or Grafana. If you want the Claude-stack story rendered correctly — subagent trees, prompt-id correlation, Cowork approval flows — you want us.
why “Claude-stack-only” is a feature, not a limitation
Generic OTel backends optimise for breadth. They render every span the same way: a name, a duration, a parent id, an attribute bag. That’s the right shape for a microservice mesh. It’s the wrong shape for an agent loop.
A Claude Code session is not a flat list of spans. It’s a tree:
- one
claude_code.interactionper user turn - N
claude_code.llm_requestchildren, with model + token + cost attributes - M
claude_code.toolchildren, sometool.blocked_on_user, sometool.execution - correlated by
prompt.idacross the whole tree, including Agent SDK subprocesses (TRACEPARENT propagated automatically) - and — if you use Cowork — interleaved approval, file-access, skills/plugins events
Modeling this generically means every panel reduces to “list of spans with these attributes.” Modeling it specifically means we can show “for prompt id X, here is the tree, here are the tool decisions, here is the cost broken out by model, and here is the approval flow that gated it.”
That’s what we do. Generic backends won’t, and shouldn’t — it’s not their job.
vs generic OTel backends
| Vendor | Strengths | Why it’s the wrong shape for the Claude stack |
|---|---|---|
| Datadog | Mature, complete signal coverage, alerting, RUM | Pricing scales with custom-metric cardinality. A 200-dev shop on Claude Code easily exceeds the budget. No Claude-aware dashboards. Cowork approval events arrive as opaque log lines. |
| Honeycomb | Best-in-class trace exploration, BubbleUp | Same cardinality cost shape. Wide-event model is great for backend services, awkward for prompt.id-scoped agent trees. |
| Grafana Cloud | Ships an official Claude Code integration with basic dashboards. Best of the generics. | Still pays per active series. No Cowork approval rendering. EU residency requires the higher tier. |
| New Relic / Elastic / SigNoz / Sealos / Quesma | Solid OTel support, self-host options | All published Claude Code dashboards in early 2026. None specialise on cost attribution per dev/repo. Agent SDK traces fragment across subprocess boundaries. |
The common pattern: ingest fine, render generically, bill by span volume. None of them know what a prompt.id is supposed to mean across Claude Code + Agent SDK + Cowork.
vs LLM-native tools
| Vendor | Strengths | Why it’s the wrong shape for the Claude stack |
|---|---|---|
| Langfuse | Open-source, popular for prompt experiments, has a Claude Agent SDK guide | Treats traces generically — single LLM call per row. No subagent tree. Cowork’s tool/file/approval events are out of scope. |
| Helicone | Cheap proxy-based model, clean UI | Sits in the API path, not the OTLP path. Doesn’t see claude_code.tool decisions, hooks, or Cowork’s approval flow. |
| LangSmith / Phoenix / Traceloop / Braintrust / Weave | Each has a niche (evals, dataset curation, framework tracing) | All are LLM-API-call-shaped. Agent loops are second-class at best. Cowork is unsupported. |
These tools are great if your unit of analysis is “one LLM call.” The Claude stack’s unit of analysis is a prompt.id — one user turn that fans out into dozens of tool calls, sub-agents, hooks, and (with Cowork) human approvals.
vs Anthropic’s own dashboards
Anthropic ships first-class OTel instrumentation across Claude Code, Agent SDK, and Cowork. They explicitly do not ship a hosted backend — the docs say “bring your own”. Metric AI is that backend. If Anthropic ever flips on a hosted dashboard toggle for the SMB segment, our wedge becomes price + EU residency + cross-tool correlation (Claude Code plus Cowork in the same view).
use both — the recommended pattern
You probably already run Datadog or Grafana for your services. Keep doing that. Point Claude Code, Agent SDK, and Cowork at us via standard OTLP env vars. The two backends don’t compete — they cover different parts of your telemetry.
[your microservices, databases, K8s] ────────► Datadog / Honeycomb / Grafana
[Claude Code + Agent SDK + Cowork] ────────► Metric AI
Both use OTLP. Both honour W3C trace context. Different unit of analysis, different bill, different rendering.
price comparison
For a 100-developer shop running Claude Code at moderate volume:
| Tier | Backend | Approx monthly |
|---|---|---|
| Generic OTel (Datadog) | full retention, all spans | $4–10k |
| Generic OTel (Grafana Cloud) | basic Claude Code integration | $1.5–3k |
| LLM-native (Langfuse cloud) | per-event billing | $500–1.5k (no subagent trees, no Cowork) |
| Metric AI Metrics tier | $3 / dev / mo | $300 |
| Metric AI Traces tier | $8 / dev / mo | $800 |
Numbers are illustrative — every shop’s volume is different — but the ratio holds.
what we won’t beat
If you need Datadog’s RUM, log correlation, infra metrics, and APM in one pane of glass: keep Datadog. We are not trying to replace it. We are the focused observability layer for the Claude stack — not the SRE platform.