comparison
Metric AI vs Datadog / Honeycomb / Grafana vs Langfuse / Helicone.
tl;dr
There are two existing categories of backend you could point Claude Code at. Generic OTel (Datadog, Honeycomb, Grafana) ingests the data fine but doesn’t model the agent loop, and the bill scales with span volume. LLM-native (Langfuse, Helicone, LangSmith) models the LLM call but not the agent: no subagent trees, no per-developer cost attribution, no per-repo rollups.
Metric AI is the third category: the cheap, opinionated, Claude-native backend.
generic OTel
| Vendor | Strengths | Weaknesses for Claude Code |
|---|---|---|
| Datadog | Mature, complete signal coverage, alerting, RUM | Pricing scales with custom-metric cardinality. A 200-dev shop on Claude Code easily exceeds the budget. No Claude-aware dashboards. |
| Honeycomb | Best-in-class trace exploration, BubbleUp | Same cardinality cost shape. Wide-event model is great for backend services, awkward for agent-loop spans. |
| Grafana Cloud | Ships an official Claude Code integration with basic dashboards. Best of the generics. | Still pays per active series. Subagent tree viewer is not included. EU residency requires the higher tier. |
| New Relic / Elastic / SigNoz | Solid OTel support, self-host options | Same shape as the above. Elastic published a Cowork-monitoring blog post but the backend is still generic. |
The common pattern: ingest fine, render generically, bill by span volume. None of them know what a prompt.id is supposed to mean.
LLM-native
| Vendor | Strengths | Weaknesses for Claude Code |
|---|---|---|
| Langfuse | Open-source, popular for prompt experiments, has a Claude Agent SDK guide | Treats traces generically — single LLM call per row. No subagent tree, no per-developer cost attribution. |
| Helicone | Cheap proxy-based model, clean UI | Sits in the API path, not the OTLP path. Doesn’t see tool calls, hooks, or approval flows. |
| LangSmith / Phoenix / Traceloop / Braintrust / Weave | Each has a niche (evals, dataset curation, framework tracing) | All are LLM-API-call-shaped. Agent loops are second-class at best. |
These tools are great if your unit of analysis is “one LLM call.” Claude Code’s unit of analysis is a prompt.id — one user turn that fans out into dozens of tool calls, sub-agents, and hooks.
Metric AI
We specialise where the generics generalise:
- Subagent trees keyed by
prompt.id. Render the parent / child span chain in one screen. - Cost attribution per developer, per repo, per prompt. Surface the top 1% prompts burning your budget.
- Cache-hit ratios per developer. Coachable target, not a hidden ratio.
- Tool decision and
blocked_on_userrates. Human-in-the-loop friction, made visible. - Cloudflare-only data plane. EU / UK residency is a region flip.
- No raw bodies stored by us. Forensics tier signs URLs into your own bucket.
The trade is that we are not a general-purpose OTel backend. If you want to instrument your Postgres database alongside Claude Code, point that at Honeycomb — point Claude Code at us.
price comparison
For a 100-developer shop running Claude Code at moderate volume:
| Tier | Backend | Approx monthly |
|---|---|---|
| Generic OTel (Datadog) | full retention, all spans | $4–10k |
| Generic OTel (Grafana Cloud) | basic Claude Code integration | $1.5–3k |
| LLM-native (Langfuse cloud) | per-event billing | $500–1.5k (no subagent trees) |
| Metric AI Metrics tier | $3 / dev / mo | $300 |
| Metric AI Traces tier | $8 / dev / mo | $800 |
Numbers are illustrative — every shop’s volume is different — but the ratio holds.
what we won’t beat
If you need Datadog’s RUM, log correlation, infra metrics, and APM in one pane of glass: keep Datadog. We are not trying to replace it. We are the focused observability layer for the agent loop, not the SRE platform.