architecture
Cloudflare-only data path. Ingest worker, queue, consumer, cron, dashboard.
All three Anthropic-stack sources — Claude Code (per-dev CLI), Claude Agent SDK (subprocess traces, TRACEPARENT-propagated from the parent Claude Code session), and Claude Cowork (org-level OTLP from the admin portal) — feed the same /v1/{traces,metrics,logs} endpoint with the same shared org bearer. The consumer detects the source from service.name (or Cowork-specific markers) at ingest and tags every row with it; the dashboard segments each panel via the source chip.
[claude code · agent sdk · cowork]
│ OTLP/HTTP JSON or protobuf + Authorization: Bearer <org-token>
▼
otlp.metric-ai.nativekloud.com ── Worker: ingest (Hono)
│ validate token (SHA-256 → KV) → push raw to Queue
▼
[CF Queue: metric-ai-otlp-ingest] (DLQ: -dlq, max_retries=3)
│
▼
Worker: queue consumer
├─ detectSource() from resource attrs (claude_code | agent_sdk | cowork)
├─ decode traces (JSON or protobufjs/light) → thin span rows → D1.spans (with source)
├─ decode metrics → 1-min pre-agg → WAE.metric_ai_metrics (8 blobs incl. source)
└─ decode logs → event rows → D1.events; tool_decisions row when name matches
[CF Cron: every 15 min]
├─ WAE SQL (today + yesterday UTC, GROUP BY source) → upsert D1.rollups_metrics_daily
├─ D1 self-aggregation of tool_decisions → D1.rollups_decisions_daily
└─ evaluate alerts → POST webhook (Slack-compatible)
app.metric-ai.nativekloud.com ── Worker: /auth/* + /api/* + Wrangler [assets] SPA
├─ /auth/{request,verify,logout,me} (email-OTP via Resend, sessions in D1)
├─ /api/me, /api/rollups, /api/spans/tree
├─ /api/cost-trend, /api/cache-hit, /api/tool-decisions,
│ /api/active-users, /api/top-prompts, /api/subagent-stats
├─ /api/billing/{status,checkout,cancel,events} (Mollie)
└─ /api/admin/{tokens,members} (admin role required)
otlp.metric-ai.nativekloud.com/webhooks/mollie (no auth — re-fetches resource by id)
ingest worker
A single Hono Worker behind otlp.metric-ai.nativekloud.com. It does the cheapest possible job: bearer validation against a KV-stored SHA-256 hash, then push to a Cloudflare Queue. No decoding, no DB writes, no analytics calls. Hot path stays under 10ms and the queue absorbs burst traffic during long agent sessions.
queue consumer
A second Worker bound to the queue. For each batch it detects the source per resource block, decodes OTLP/JSON or protobuf, splits by signal type, and fans out:
- Traces become thin rows in
D1.spans— trace id, span id, parent, name, duration, source, plus a handful of indexed attributes (prompt.id,user.email,repo,model,tool_name, token columns). Cost is computed at READ time from token columns + the pricing table — never persisted, so price changes never silently corrupt history. - Metrics become Workers Analytics Engine writes, pre-aggregated to 1-minute buckets keyed by
(org, user, repo, model, tool, status, metric_name, source). - Logs become event rows in
D1.events; events that matchclaude_code.tool_decisionadditionally land inD1.tool_decisionsfor fast approval-rate queries. Bodies are not stored.
A dead-letter queue catches malformed payloads after three retries.
cron rollup + alerts worker
Every 15 minutes a single cron handler does three things in sequence:
- WAE → D1 metric rollup — SQL groups by
(day, org, user, repo, model, source), sums tokens + cost + cache columns, upsertsrollups_metrics_daily. - D1 → D1 decision rollup — groups
tool_decisionsby(day, org, user, source), counts approved/denied, upsertsrollups_decisions_daily. - Alert evaluator — for each active alert past its cooldown, computes the metric over its window, posts a Slack-compatible webhook on threshold breach. SSRF-safe (RFC1918/loopback blocked, https-only).
Dashboard panels read from the rollup tables — never from WAE directly — so panel queries stay sub-100ms.
dashboard worker
app.metric-ai.nativekloud.com serves a React SPA from a Wrangler [assets] binding plus /auth/* + /api/* Hono routers. Email-OTP login (POST /auth/request → 6-digit code via Resend → POST /auth/verify → mai_sess HttpOnly Secure SameSite=Lax cookie scoped to .metric-ai.nativekloud.com, 14-day TTL). Every /api/* request resolves the user’s org once in middleware via the members table.
resources
| Resource | Binding | Purpose |
|---|---|---|
D1 metric-ai-db | DB | spans, events, rollups, sessions, otp_challenges, alerts, subscriptions, billing_events, members, orgs, ingest_tokens, tool_decisions |
KV TOKENS | TOKENS | SHA-256 bearer → {org_id, user_email} lookup |
Queue metric-ai-otlp-ingest | INGEST_QUEUE | ingest → consumer fan-out |
Analytics Engine metric_ai_metrics | WAE | 1-min pre-aggregated metrics, 8 blob slots |
| Static assets | ASSETS | dashboard SPA |
Cron */15 * * * * | — | rollups + alerts |
why this shape
- Pre-aggregate at ingest. A 12-hour session is ~50k spans; we collapse it to ~200 WAE rows. Storage cost stays linear in dev-count, not span-count.
- D1 only for thin metadata. The subagent-tree viewer needs span lineage; we don’t need the full attribute bag.
- Daily roll-ups for dashboard reads. Panel queries hit a small table, not a span firehose.
- Cost computed at read, never stored. Pricing changes don’t silently corrupt history.
- Source as first-class column. Every aggregate honors the source filter — no “this view aggregates across sources” weasel-words.
- Cloudflare-only. No cross-cloud egress, no third-party SaaS in the hot path. EU residency is a region flip, not a re-architecture.