← back to docs

comparison

Metric AI vs Datadog / Honeycomb / Grafana vs Langfuse / Helicone.

tl;dr

There are two existing categories of backend you could point Claude Code at. Generic OTel (Datadog, Honeycomb, Grafana) ingests the data fine but doesn’t model the agent loop, and the bill scales with span volume. LLM-native (Langfuse, Helicone, LangSmith) models the LLM call but not the agent: no subagent trees, no per-developer cost attribution, no per-repo rollups.

Metric AI is the third category: the cheap, opinionated, Claude-native backend.

generic OTel

VendorStrengthsWeaknesses for Claude Code
DatadogMature, complete signal coverage, alerting, RUMPricing scales with custom-metric cardinality. A 200-dev shop on Claude Code easily exceeds the budget. No Claude-aware dashboards.
HoneycombBest-in-class trace exploration, BubbleUpSame cardinality cost shape. Wide-event model is great for backend services, awkward for agent-loop spans.
Grafana CloudShips an official Claude Code integration with basic dashboards. Best of the generics.Still pays per active series. Subagent tree viewer is not included. EU residency requires the higher tier.
New Relic / Elastic / SigNozSolid OTel support, self-host optionsSame shape as the above. Elastic published a Cowork-monitoring blog post but the backend is still generic.

The common pattern: ingest fine, render generically, bill by span volume. None of them know what a prompt.id is supposed to mean.

LLM-native

VendorStrengthsWeaknesses for Claude Code
LangfuseOpen-source, popular for prompt experiments, has a Claude Agent SDK guideTreats traces generically — single LLM call per row. No subagent tree, no per-developer cost attribution.
HeliconeCheap proxy-based model, clean UISits in the API path, not the OTLP path. Doesn’t see tool calls, hooks, or approval flows.
LangSmith / Phoenix / Traceloop / Braintrust / WeaveEach has a niche (evals, dataset curation, framework tracing)All are LLM-API-call-shaped. Agent loops are second-class at best.

These tools are great if your unit of analysis is “one LLM call.” Claude Code’s unit of analysis is a prompt.id — one user turn that fans out into dozens of tool calls, sub-agents, and hooks.

Metric AI

We specialise where the generics generalise:

  • Subagent trees keyed by prompt.id. Render the parent / child span chain in one screen.
  • Cost attribution per developer, per repo, per prompt. Surface the top 1% prompts burning your budget.
  • Cache-hit ratios per developer. Coachable target, not a hidden ratio.
  • Tool decision and blocked_on_user rates. Human-in-the-loop friction, made visible.
  • Cloudflare-only data plane. EU / UK residency is a region flip.
  • No raw bodies stored by us. Forensics tier signs URLs into your own bucket.

The trade is that we are not a general-purpose OTel backend. If you want to instrument your Postgres database alongside Claude Code, point that at Honeycomb — point Claude Code at us.

price comparison

For a 100-developer shop running Claude Code at moderate volume:

TierBackendApprox monthly
Generic OTel (Datadog)full retention, all spans$4–10k
Generic OTel (Grafana Cloud)basic Claude Code integration$1.5–3k
LLM-native (Langfuse cloud)per-event billing$500–1.5k (no subagent trees)
Metric AI Metrics tier$3 / dev / mo$300
Metric AI Traces tier$8 / dev / mo$800

Numbers are illustrative — every shop’s volume is different — but the ratio holds.

what we won’t beat

If you need Datadog’s RUM, log correlation, infra metrics, and APM in one pane of glass: keep Datadog. We are not trying to replace it. We are the focused observability layer for the agent loop, not the SRE platform.