I've used Google Analytics since it was called Urchin — the web analytics software Google acquired in 2005. (I'm not sure how many people realize that UTM stands for Urchin Tracking Module.) Before the rebrand, before every marketing team on earth had a gtag.js snippet in their <head>. So when I finally ripped it out of this site — gone in one commit — I expected to feel something. Loss, maybe. Nostalgia for the old real-time dashboard with the live world map.
I felt relief.
GA4 had been dragging my Lighthouse score into the mid-80s. I even tried offloading it to a service worker with Partytown — limited improvement. The analytics tool was the biggest performance problem on the site. There's something deeply ironic about your measurement system being the thing that degrades the experience you're trying to measure.
But it wasn't just the performance tax. It was the realization that I'd been collecting data I never looked at, for an audience model that doesn't apply to a personal blog, using an interface redesigned for enterprise marketing teams who need attribution funnels and audience segments. I don't have funnels. I have essays.
I want a small, legible system that tells me what landed, what didn't, and what changed — without turning my site into spyware or slowing it down. This is a draft spec for that system.
#Problem framing
Most analytics tools optimize for one of two extremes:
- Marketing surveillance suites (powerful, heavy, identity-centric)
- Minimal privacy counters (lightweight, sometimes too shallow)
A personal blog needs a third thing:
- privacy-first,
- edge-native,
- cheap enough to run free,
- and queryable in natural language without outsourcing judgment.
The goal is not more dashboards. The goal is better questions.
#First principles
- Signal over exhaust. What posts are actually read? Which referrers produce meaningful attention?
- Trend shape over user identity. Momentum, decay curves, evergreen vs spike.
- No trust tax. No fingerprinting, no cross-site identifiers, no covert profiling.
- No performance tax. No Lighthouse regressions, no render-path blockers. The irony of GA failing this one still stings.
- No ops tax. Runs on Cloudflare's free tier for low/moderate traffic. Deploys via GitHub Actions on push. No servers to maintain, no containers to manage, no uptime to monitor.
- AI as interface, not authority. Conversational query layer over deterministic metrics.
#Prior art (and what to steal)
I looked at everything. The privacy-native tools — Plausible, Umami, GoatCounter, Ackee — get the lightweight collection right: small scripts, no cookies, respect for the reader. Plausible in particular would be the obvious choice if I didn't want to build my own. But I do. Matomo and PostHog are impressive and wildly overkill for a solo blog.
The interesting steals are from adjacent systems:
- OpenTelemetry: not the SDK (enormous) or the collector (a whole deployment), but the naming discipline.
http.request.method,url.path,user_agent.original— standardized field names that make data portable without tribal knowledge. If the schema looks like OTel, any tool can query it later. - Axiom.co: the ingest-first philosophy. Raw-ish event logs in a columnar format you can query ad-hoc. This maps directly to the R2 monthly exports — NDJSON that DuckDB, BigQuery, or Axiom itself can ingest. If this system outgrows D1, Axiom is the natural escape hatch.
- Sentry: not the 180KB SDK, but the pattern.
window.onerror+onunhandledrejectionas a ~15-line error signal. Errors are analytics too — "this page throws a TypeError on Safari mobile" is useful data.
The synthesis: steal from the lightweight tools for collection posture, from the heavy tools for naming conventions and philosophy. Skip everyone's infrastructure.
#Architecture judgment
The system should reflect how I build things, not just what it does.
- TypeScript, Bun. Type safety everywhere. If a function can return the wrong shape, it will — at 3am, in production.
- Pure functions and composition over classes.
classifyReferrer(host) → Sourceis a function.ReferrerClassifieris a class that exists so someone can write a constructor. Prefer the function. - Never write directly to the database from the hot path. Every event goes through the DO aggregator first. The DO is the write buffer. D1 only sees batched, coalesced flushes — never raw request-time writes. If the flush fails, the DO retries. If D1 is contended, the DO backs off. The database is never the thing that breaks under load.
- Backpressure is not optional. If ingestion outpaces aggregation, the DO queues. If aggregation outpaces D1, the DO backs off exponentially. If everything's on fire, governor mode kicks in and sheds load gracefully. No silent data loss. No "it seemed fine in dev."
- Modularity through boundaries, not abstractions. Ingestion, aggregation, storage, query, and AI are separate Workers/DOs with typed interfaces between them. Not a monolith with "clean architecture" folders. Actual deployment boundaries.
- Idempotent everything. Every flush has a batch ID. Every write is safe to retry. The system should be correct after a crash, not just during normal operation.
- Append-only where it matters, mutable where it serves. Raw events are append-only at the ingestion boundary — never mutate what the client sent. Aggregates are mutable by design (upserts that coalesce counters). R2 exports are immutable — write-once NDJSON, never overwritten. Three different postures for three different concerns: integrity at the edge, efficiency in the middle, portability at the exit.
#Assumptions
- Cloudflare Workers + D1 + Durable Objects are available
- JavaScript may be blocked — baseline still works server-side for pageviews where possible
- Coarse-grained analytics (daily/hourly aggregates) are enough
- No individual user/session re-identification. Period
- Reading quality and trend direction matter more than ad attribution precision
- Script budget: ≤1.5KB gzipped (aspirational), hard cap ≤2KB
#Requirements
Because this is a design exercise as much as a product spec, I want to be explicit about what this system must do versus how it must feel doing it. These aren't aspirational — they're the acceptance criteria I'll hold myself to.
#Functional requirements
These are the capabilities. What the system does when everything's working.
| ID | Requirement | Acceptance |
|---|---|---|
| FR-1 | Pageview collection — capture every non-bot page load with path, referrer, and timestamp | Ingestion Worker returns 204; fact table increments within flush interval |
| FR-2 | Engagement scoring — composite signal from scroll depth, dwell time, and outbound clicks | engaged event fires once per pageview when any threshold is met |
| FR-3 | Scroll milestone tracking — quartile depth markers (25/50/75/100%) per article | Four IntersectionObserver instances; each fires once and disconnects |
| FR-4 | Outbound click tracking — capture external link clicks from article body | outbound event with href and anchor text in meta |
| FR-5 | Referrer classification — normalize raw referrers into a bounded enum (≤15 sources) | classifyReferrer() maps hostnames; unknown → "other" |
| FR-6 | Traffic classification — categorize requests as human/bot/AI crawler | 5-class system using cf.botManagement.score + UA pattern matching |
| FR-7 | Core Web Vitals — LCP, INP, CLS at p75 per path per device class | web-vitals library → custom events → DDSketch aggregation in DO |
| FR-8 | JS error capture — window.onerror and onunhandledrejection as lightweight error signal | Error count per path per day; filename-only source (no full URLs) |
| FR-9 | Ranked queries — top posts, top referrers, period comparison, timeseries | Typed API endpoints returning deterministic results |
| FR-10 | AI query layer — natural language questions answered with cited evidence | POST /api/ask → planner → ≤8 tool calls → evidence-backed answer |
| FR-11 | MCP tools — agent-queryable analytics surface | 6 typed tools matching the query API |
| FR-12 | Snapshot page — glanceable /analytics dashboard (auth-gated) | 4 cards: 7-day pageviews, top posts, top referrers, week-over-week delta |
| FR-13 | Data export — monthly NDJSON snapshots to R2 | OTel-aligned field names; importable by DuckDB/BigQuery/Axiom |
#Nonfunctional requirements
These are the constraints. How the system behaves under pressure, at the edges, and over time — the stuff that determines whether you actually trust it.
| ID | Requirement | Target | Enforcement |
|---|---|---|---|
| NFR-1 | Client weight | ≤2KB gzipped (hard cap) | CI size gate fails the build |
| NFR-2 | Lighthouse impact | Zero measurable regression in LCP/INP/CLS p75 | No synchronous XHR, no DOM writes, no layout-affecting elements |
| NFR-3 | Privacy | No cookies, no localStorage IDs, no fingerprinting, no raw IP storage | Enforced at ingestion — geo derived server-side, then discarded |
| NFR-4 | Write integrity | Every flush is idempotent and transactional | flush_batch table; duplicate batch_id → skip |
| NFR-5 | Backpressure | Graceful degradation under load (governor mode) | 4-step ladder: disable custom events → drop breakdowns → sample → daily-only |
| NFR-6 | Cardinality bounds | No unbounded dimension growth | Hard limits: 10K paths, 1K referrers, 10 meta keys per event |
| NFR-7 | Storage cost | Proportional to content count, not traffic volume | rrdtool-inspired rollup: hourly → daily → core; Cron-enforced retention |
| NFR-8 | Free-tier operation | Runs entirely on Cloudflare's free plan at low/moderate traffic | Aggregate-first writes, bounded dimensions, query caching |
| NFR-9 | Validation | Every boundary has a runtime schema | Zod at ingestion, DO, and query layer; types derived from validators |
| NFR-10 | Portability | Data is always exportable in a standard format | Monthly R2 NDJSON with OTel-aligned naming |
| NFR-11 | Ops overhead | Zero servers to maintain | Deploys via wrangler in CI; no containers, no uptime monitoring |
| NFR-12 | Recovery | Correct after crash, not just during normal operation | Idempotent flushes, forward-only migrations, append-only raw events |
A few things to notice about this split: the FRs are roughly ordered by collection → query → export (data flows downhill). The NFRs are roughly ordered by what your reader cares about (weight, speed, privacy) → what your system cares about (integrity, pressure, bounds) → what future you cares about (portability, ops, recovery). That ordering is intentional — it mirrors who screams first when something breaks.
#Non-goals (v1)
- Session replay, heatmaps
- Fingerprinting
- Cross-device identity stitching
- Real-time per-user journey visualization
- Freeform SQL generated by the LLM
- Return visitor tracking — can't do it honestly without cookies or fingerprinting, and the proxy signals (growing direct traffic, RSS subscribers, stable engagement rates across posts) already tell you whether you have an audience. A post with 65% completion doesn't care whether the reader was new or returning
#RFC: AI-Native Analytics for Blogs (v0.1)
#1) High-level architecture
View source
graph TD
subgraph Client
A["a.js (≤2KB)"]
end
subgraph Ingestion
B[Worker: /a/collect]
end
subgraph Aggregation
C[Durable Object] --> D[(D1)]
D -.->|monthly snapshot| E[(R2 exports)]
end
subgraph Query
F[Worker: /api/metrics]
G[MCP Tools]
H["analytics snapshot"]
end
A -->|sendBeacon| B
B -->|validate + normalize| C
D --> F & G & HIngestion plane
- Worker endpoint:
POST /a/collect - Accepts compact events (
pv,engaged,outbound, optional custom) - Validates payload via Zod, normalizes dimensions, forwards to per-site aggregator DO
Aggregation plane
- Durable Object per
site_id - Maintains short-lived counters in memory
- Flushes batched aggregates to D1 (interval and size thresholds)
- rrdtool-inspired: coalesce first, persist later
Storage plane
- D1 stores aggregate facts + compact dimensions
- Fixed-size storage inspired by rrdtool/Graphite whisper (see below)
- Optional: R2 monthly snapshots / exports
- Optional: KV for query cache
Query plane
- Worker endpoint:
GET /api/metrics/* - Serves typed metric queries from D1
- Caches common dashboard reads
AI plane
POST /api/askwith constrained query planner- MCP server exposes typed analytics tools
- LLM summarizes deterministic outputs only
#2) Collection protocol
Event envelope
export type EventKind = "pv" | "engaged" | "outbound" | "error" | "vital" | "custom";
export interface CollectEvent {
kind: EventKind;
ts: number; // epoch ms
path: string; // normalized pathname
ref?: string; // raw referrer (optional)
title?: string; // optional page title hashable
utm?: {
source?: string;
medium?: string;
campaign?: string;
};
meta?: Record<string, string | number | boolean | null>;
}
export interface CollectRequest {
site: string; // public site key
v: 1; // protocol version
events: CollectEvent[];
}Default events (fire automatically)
The client script emits these without any configuration:
| Event | When | Data |
|---|---|---|
| pv | DOMContentLoaded | path, ref, title |
| engaged | Composite trigger (scroll/dwell/click) | maxDepth (0-100) |
| outbound | Click on with external href | path, meta.href, meta.text |
| milestone | Each scroll quartile crossed | meta.depth (25/50/75/100) |
| error | window.onerror / onunhandledrejection | meta.message, meta.source, meta.line, meta.col |
| vital | web-vitals callback | meta.vital (lcp/inp/cls), meta.value |
JS error capture (Sentry-lite)
Not a full error monitoring product. Just a signal: "something broke on this page."
window.onerror = (message, source, line, col) => {
queue({
kind: "error" as EventKind,
ts: Date.now(),
path: location.pathname,
meta: {
message: String(message).slice(0, 256),
source: source?.split("/").pop() ?? "unknown", // filename only, no full URLs
line: line ?? 0,
col: col ?? 0,
},
});
};
window.onunhandledrejection = (event) => {
queue({
kind: "error" as EventKind,
ts: Date.now(),
path: location.pathname,
meta: {
message: String(event.reason).slice(0, 256),
source: "unhandled_rejection",
},
});
};What this answers:
- "Is my site broken on Safari mobile?" — check error rate by device class
- "Did my last deploy introduce a regression?" — compare error rates across dates
- "Is this a real problem or a browser extension?" —
sourcefilename distinguishes your code from injected scripts
What this doesn't do: stack traces, source maps, breadcrumbs, user replay. For that, use Sentry. For a blog, error rate is the signal. Error detail is a debugging tool you reach for when the rate spikes.
Stored as custom events, aggregated into a simple error counter on fact_daily:
ALTER TABLE fact_daily ADD COLUMN error_count INTEGER NOT NULL DEFAULT 0;Custom events (custom) require explicit instrumentation:
// Only if someone wants to track something specific
window.__a?.track("signup_click", { cta: "header" });The API surface is one function. If window.__a doesn't exist (script blocked, not loaded yet), the call silently no-ops.
OTel-inspired semantic naming
We don't use the OpenTelemetry SDK. But we steal its naming discipline. When data leaves this system — via R2 export, MCP tool output, or the ask endpoint — field names should be recognizable to anyone who's worked with observability tooling.
| Our field | OTel semantic convention | Why it matters |
|---|---|---|
| path | url.path | Portable to any analytics tool |
| ref | http.request.header.referer | Standard, not invented |
| device_class | user_agent.device.type | DuckDB/BigQuery can join on this |
| country_code | client.geo.country_iso_code | ISO 3166-1 alpha-2, universally understood |
| traffic_class | http.request.bot_score_class | Our addition - no OTel equivalent yet |
The R2 monthly exports use OTel-aligned field names even if the internal D1 schema uses shorter column names. Export is the portability boundary — that's where naming discipline pays off.
Lightweight RUM (Real User Monitoring)
Core Web Vitals are too useful to skip and too small to justify skipping. Google's web-vitals library is ~1.5KB and gives you LCP, INP, and CLS — the three metrics that determine whether your analytics script is harming the site it's measuring.
import { onLCP, onINP, onCLS } from "web-vitals"; // tree-shaken
onLCP((metric) => queue({ kind: "custom", ts: Date.now(), path, meta: { vital: "lcp", value: metric.value } }));
onINP((metric) => queue({ kind: "custom", ts: Date.now(), path, meta: { vital: "inp", value: metric.value } }));
onCLS((metric) => queue({ kind: "custom", ts: Date.now(), path, meta: { vital: "cls", value: metric.value } }));This isn't a full RUM product. It's a canary:
- "Did my last deploy regress LCP?" — check the trend
- "Is the analytics script itself causing CLS?" — it better not be
- "What's my real INP on mobile?" — useful for any interactive elements
Stored as custom events, aggregated into a fact_daily_vitals table:
CREATE TABLE IF NOT EXISTS fact_daily_vitals (
site_id TEXT NOT NULL,
date_utc TEXT NOT NULL,
path_id INTEGER NOT NULL,
device_class TEXT NOT NULL DEFAULT 'unknown',
lcp_p75 REAL, -- milliseconds
inp_p75 REAL, -- milliseconds
cls_p75 REAL, -- unitless score
sample_count INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (site_id, date_utc, path_id, device_class),
FOREIGN KEY(site_id) REFERENCES site(id),
FOREIGN KEY(path_id) REFERENCES dim_path(id)
);p75 aggregation happens in the DO — not by storing raw values, but by maintaining a DDSketch or simpler quantile estimator in memory and flushing the percentile. Fixed memory regardless of sample count.
Budget impact: web-vitals is ~1.5KB gzipped. Combined with the core client (~500 bytes), total stays under the 2KB hard cap.
Protocol compatibility:
- server currently accepts
v=1 - unknown fields are ignored (forward-compatible reads)
- future protocol bumps should be explicit (
v=2handling path + migration notes)
Engagement definition
An engaged event is a composite boolean, true when any of these conditions hold:
- Reader crosses 60% article depth (single
IntersectionObservermarker) - Reader keeps tab visible for >= 20s on article route
- Reader clicks an outbound link from the article body
Why not only time-on-page? It lies. A tab left open in the background inflates engagement. Why not only scroll depth? It can miss accessibility and non-scroll reading patterns. Composite engagement is still simple, but less brittle.
This signal also helps bot filtering, but it is not the only bot defense.
Scroll milestones
Beyond the binary engagement signal, we track how far readers get. Four IntersectionObserver instances on invisible markers at article quartiles (25%, 50%, 75%, 100%). Each fires exactly once and disconnects.
const MILESTONES = [25, 50, 75, 100] as const;
type ScrollMilestone = typeof MILESTONES[number];
function observeScrollDepth(article: HTMLElement) {
const height = article.offsetHeight;
let maxDepth = 0;
for (const pct of MILESTONES) {
const marker = document.createElement("div");
marker.style.cssText =
"height:1px;width:1px;position:absolute;pointer-events:none;";
marker.style.top = `${(pct / 100) * height}px`;
article.style.position ||= "relative";
article.appendChild(marker);
const observer = new IntersectionObserver(
([entry]) => {
if (entry.isIntersecting) {
maxDepth = Math.max(maxDepth, pct);
observer.disconnect();
}
},
{ threshold: 0 },
);
observer.observe(marker);
}
return () => maxDepth;
}Why IntersectionObserver instead of scroll listeners:
- Zero main-thread work during scroll. Runs on the compositor thread — can't cause jank.
- No throttle/debounce decision to get wrong. The browser tells you when the marker enters the viewport.
- No
getBoundingClientRect()calls. No forced reflow, no layout thrashing. - Four callbacks total for the entire page visit. Then silence.
What this unlocks for a writer:
- "My analytics RFC has 90% reach-25 but only 30% reach-75 — people bail at the schema section"
- "The pervasive AI essay has 65% completion rate — the long version holds better than expected"
- "Posts under 1,500 words: 80% completion. Over 2,500: 45%. But the 45% that finish have 3x the outbound click rate"
The engagement event becomes: { kind: "engaged", maxDepth: 75, ... } instead of just a boolean.
Add max_depth to the daily fact:
ALTER TABLE fact_daily ADD COLUMN avg_max_depth REAL; -- 0-100
ALTER TABLE fact_daily ADD COLUMN completions INTEGER NOT NULL DEFAULT 0; -- reached 100%Transport
let pending: CollectEvent[] = [];
let flushTimer: number | null = null;
function queue(event: CollectEvent) {
pending.push(event);
if (!flushTimer) {
flushTimer = setTimeout(flush, 2000); // 2s debounce
}
if (pending.length >= 10) flush(); // size cap
}
function flush() {
if (!pending.length) return;
navigator.sendBeacon(
"/a/collect",
JSON.stringify({ site: SITE_KEY, v: 1, events: pending }),
);
pending = [];
if (flushTimer) clearTimeout(flushTimer);
flushTimer = null;
}
document.addEventListener("visibilitychange", () => {
if (document.visibilityState === "hidden") flush();
});sendBeacononvisibilitychangewithfetch({ keepalive: true })fallback- 2-second debounce to batch rapid events (pageview + milestone + engagement)
- Hard cap at 10 events per batch
- Never blocks navigation
Why no Service Worker
A Service Worker could queue failed beacons offline and retry via Background Sync. Sounds responsible. Not worth it.
sendBeaconalready survives page navigation — the only failure case is total network loss, and losing one pageview during a subway tunnel is not a data integrity problem for a blog- Service Workers add a whole lifecycle (install, activate, update, cache versioning) for a client that's supposed to be <2KB and understandable in 60 seconds
- Service Workers can't call
sendBeacon— you'd reimplement the same reliability withfetch, wrapping a reliable API in a more complex one for the same result - The analytics script is <2KB with
Cache-Control: immutable— the browser's HTTP cache handles this without a SW cache layer - Cloudflare Workers already are the server-side "service worker" — intelligence on both ends creates two places to debug the same problem
The entire client architecture:
a.js (< 2KB gzipped)
├── IntersectionObserver × 4 (scroll milestones)
├── event queue (plain array)
├── debounced flush (2s / 10 events)
├── sendBeacon on visibilitychange
└── that's itNo Service Worker. No IndexedDB. No Background Sync. No scroll listeners. Understandable in 60 seconds.
#3) Privacy model
- No cookies by default
- No localStorage identifiers
- No fingerprinting
- No raw IP storage (derive coarse geo server-side then discard)
- Configurable retention by table
- Optional per-site "strict mode" that disables custom events entirely
Collection integrity (anti-poisoning):
- signed payload mode for first-party script (
HMAC(body + ts)) - replay window enforcement (reject stale timestamps)
- optional origin allowlist for collection endpoint
#4) Data model (D1)
Schema strategy is intentionally split:
- v1: optimize for low cardinality and cheap queries (one row per
site,date,path) - v2: add breakdown tables for referrer/country/device when needed
-- Sites
CREATE TABLE IF NOT EXISTS site (
id TEXT PRIMARY KEY,
slug TEXT NOT NULL UNIQUE,
created_at INTEGER NOT NULL
);
-- Normalized dimensions
CREATE TABLE IF NOT EXISTS dim_path (
id INTEGER PRIMARY KEY AUTOINCREMENT,
site_id TEXT NOT NULL,
path TEXT NOT NULL,
path_hash TEXT NOT NULL,
UNIQUE(site_id, path_hash),
FOREIGN KEY(site_id) REFERENCES site(id)
);
CREATE TABLE IF NOT EXISTS dim_referrer (
id INTEGER PRIMARY KEY AUTOINCREMENT,
site_id TEXT NOT NULL,
source TEXT NOT NULL, -- normalized enum (see below)
raw_host TEXT, -- optional debug host
referrer_query TEXT, -- AI search query context (Phase 2, nullable)
UNIQUE(site_id, source, raw_host),
FOREIGN KEY(site_id) REFERENCES site(id)
);
-- Referrer source enum (≤15 values, normalize at ingestion):
-- direct | google | bing | duckduckgo | x | hn | reddit | linkedin
-- mastodon | bluesky | newsletter | rss | github | chatgpt | perplexity | other
--
-- Normalization: classifyReferrer(host) maps hostnames to sources.
-- e.g. t.co, twitter.com, x.com → "x"
-- e.g. news.ycombinator.com → "hn"
-- e.g. chatgpt.com, chat.openai.com → "chatgpt"
-- e.g. perplexity.ai → "perplexity"
-- Unknown hosts → "other".
-- raw_host is retained only for bounded debugging/reporting, not unbounded growth.
--
-- AI search referrer enrichment (Phase 2):
-- Some AI search engines pass query context in URL params (?q=, ?query=, etc.)
-- When referrer is an AI search source AND query params are present,
-- extract and store as referrer_query in dim_referrer.
-- This answers "why did someone land here?" - not just "where from?"
-- Bounded: only captured for AI search sources, not all referrers.
-- v1 core daily fact (low cardinality)
CREATE TABLE IF NOT EXISTS fact_daily (
site_id TEXT NOT NULL,
date_utc TEXT NOT NULL, -- YYYY-MM-DD
path_id INTEGER NOT NULL,
pageviews INTEGER NOT NULL DEFAULT 0,
engaged_visits INTEGER NOT NULL DEFAULT 0,
outbound_clicks INTEGER NOT NULL DEFAULT 0,
custom_events INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (site_id, date_utc, path_id),
FOREIGN KEY(site_id) REFERENCES site(id),
FOREIGN KEY(path_id) REFERENCES dim_path(id)
);
-- v2 optional breakdown tables
CREATE TABLE IF NOT EXISTS fact_daily_referrer (
site_id TEXT NOT NULL,
date_utc TEXT NOT NULL,
path_id INTEGER NOT NULL,
referrer_id INTEGER NOT NULL,
pageviews INTEGER NOT NULL DEFAULT 0,
engaged_visits INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (site_id, date_utc, path_id, referrer_id),
FOREIGN KEY(site_id) REFERENCES site(id),
FOREIGN KEY(path_id) REFERENCES dim_path(id),
FOREIGN KEY(referrer_id) REFERENCES dim_referrer(id)
);
CREATE TABLE IF NOT EXISTS fact_daily_country (
site_id TEXT NOT NULL,
date_utc TEXT NOT NULL,
path_id INTEGER NOT NULL,
country_code TEXT NOT NULL DEFAULT 'XX',
pageviews INTEGER NOT NULL DEFAULT 0,
engaged_visits INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (site_id, date_utc, path_id, country_code),
FOREIGN KEY(site_id) REFERENCES site(id),
FOREIGN KEY(path_id) REFERENCES dim_path(id)
);
CREATE TABLE IF NOT EXISTS fact_daily_device (
site_id TEXT NOT NULL,
date_utc TEXT NOT NULL,
path_id INTEGER NOT NULL,
device_class TEXT NOT NULL DEFAULT 'unknown', -- mobile|desktop|tablet|unknown
pageviews INTEGER NOT NULL DEFAULT 0,
engaged_visits INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (site_id, date_utc, path_id, device_class),
FOREIGN KEY(site_id) REFERENCES site(id),
FOREIGN KEY(path_id) REFERENCES dim_path(id)
);
-- Hourly optional short window (e.g. last 7-14 days)
CREATE TABLE IF NOT EXISTS fact_hourly (
site_id TEXT NOT NULL,
hour_utc TEXT NOT NULL, -- YYYY-MM-DDTHH
path_id INTEGER NOT NULL,
pageviews INTEGER NOT NULL DEFAULT 0,
engaged_visits INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (site_id, hour_utc, path_id),
FOREIGN KEY(site_id) REFERENCES site(id),
FOREIGN KEY(path_id) REFERENCES dim_path(id)
);
-- Flush idempotency
CREATE TABLE IF NOT EXISTS flush_batch (
batch_id TEXT PRIMARY KEY,
site_id TEXT NOT NULL,
created_at INTEGER NOT NULL,
FOREIGN KEY(site_id) REFERENCES site(id)
);
CREATE INDEX IF NOT EXISTS idx_fact_daily_site_date ON fact_daily(site_id, date_utc);
CREATE INDEX IF NOT EXISTS idx_fact_daily_path ON fact_daily(site_id, path_id, date_utc);
CREATE INDEX IF NOT EXISTS idx_fact_hourly_site_hour ON fact_hourly(site_id, hour_utc);Idempotency rule: every aggregator flush includes a batch_id; writes occur in one transaction that first inserts into flush_batch. If batch_id already exists, skip (already applied).
#5) Durable Object aggregator contract
export interface AggregateKey {
siteId: string;
dateUtc: string;
hourUtc?: string;
pathKey: string;
refKey?: string;
countryCode?: string;
deviceClass?: "mobile" | "desktop" | "tablet" | "unknown";
}
export interface AggregateDelta {
pageviews?: number;
engagedVisits?: number;
outboundClicks?: number;
customEvents?: number;
}
export interface AggregatorMessage {
key: AggregateKey;
delta: AggregateDelta;
}
export interface FlushPolicy {
maxBuffered: number; // e.g. 1_000
flushEveryMs: number; // e.g. 10_000
}Behavior:
- batch id required per flush (idempotent apply via
flush_batch) - coalesce in memory
- periodic/alarm-based flush to D1
- backpressure mode: if D1 contention, exponential backoff + queue in DO
- flush is transactional: register batch -> upsert facts -> commit
#6) Query API interfaces
export interface Period {
from: string; // YYYY-MM-DD
to: string; // YYYY-MM-DD
}
export interface TopPost {
path: string;
pageviews: number;
engagedVisits: number;
engagementRate: number;
}
export interface TopPostsResponse {
period: Period;
rows: TopPost[];
}
export interface ReferrerRow {
source: string;
pageviews: number;
engagedVisits: number;
}
export interface ComparePeriodsResponse {
current: {
pageviews: number;
engagedVisits: number;
};
previous: {
pageviews: number;
engagedVisits: number;
};
delta: {
pageviewsPct: number;
engagedVisitsPct: number;
};
}Endpoints:
GET /api/metrics/top-posts?from=...&to=...&limit=...GET /api/metrics/referrers?from=...&to=...GET /api/metrics/compare?from=...&to=...&prevFrom=...&prevTo=...GET /api/metrics/timeseries?path=...&granularity=day|hour
API constraints (hard limits):
- max query range: 365 days
- max
limitfor ranked endpoints: 200 - hourly granularity range cap: 14 days
- per-IP/token rate limiting on analytics read and ask endpoints
- cache-by-shape with TTL for repeated queries
#7) AI query layer
The AI calls typed metric functions. It does not write SQL.
Deterministic analysis functions
export interface AnalyticsService {
topPosts(period: Period, limit: number): Promise<TopPostsResponse>;
referrerBreakdown(period: Period): Promise<ReferrerRow[]>;
comparePeriods(current: Period, previous: Period): Promise<ComparePeriodsResponse>;
postTrend(path: string, period: Period): Promise<Array<{ date: string; pageviews: number }>>;
evergreenScore(path: string, trailingDays: number): Promise<number>;
}Ask endpoint
export interface AskRequest {
question: string;
siteId: string;
}
export interface AskResponse {
answer: string;
evidence: Array<{
metric: string;
value: number | string;
source: string; // function + params
}>;
}Guardrails:
- planner maps question -> allowed function calls
- result summaries must cite evidence blocks
- no speculative claims without corresponding metric
- answer claims are tagged with confidence classes (
high,medium,low) based on evidence directness
#8) MCP surface
Tools for agent clients:
analytics.get_top_postsanalytics.get_referrer_breakdownanalytics.compare_periodsanalytics.get_post_trendanalytics.get_evergreen_scoreanalytics.ask(optional convenience wrapper)
Tool outputs should be machine-readable first, narrative second.
#9) Performance / Lighthouse constraints
Hard requirements:
- Script is
deferand non-blocking - No synchronous XHR
- No DOM writes from analytics script
- No layout-affecting injected elements
- Beacon dispatch on
visibilitychangeandpagehidefallback - JS parse/execute budget tracked in CI
Perf acceptance targets:
- No measurable regression in LCP/INP/CLS p75 beyond noise
- CLS impact from analytics: zero
#10) Free-tier operating mode
Staying on the free tier:
- aggregate-first writes (no raw event warehouse)
- bounded cardinality dimensions
- query caching for common dashboard reads
- optional sampling when approaching quota
- hourly table retention cap (e.g. 14 days)
Safety valve: "governor mode" step-down ladder under sustained high volume:
- disable custom event ingestion
- drop country/device breakdown writes
- increase sampling factor (for pageviews only)
- disable hourly writes (daily only)
Each transition emits an ops event so reduced fidelity is explicit, not silent.
#11) Migration strategy
Phase 1: pageviews + engaged + top posts + referrers + MCP tools + tiny /analytics snapshot
Phase 2: period compare + trend series + cache/rate-limit hardening
Phase 3: conversational ask with evidence citations
Phase 4: optional exports (R2) + scheduled digests
#12) Enforced constraints (not advisory)
Advisory constraints get ignored under pressure. These are enforced in CI, at build time, or at runtime — not in a README.
Schema validation (Zod)
Every boundary gets a Zod schema. Not just "for documentation" — for runtime parsing.
const CollectEventSchema = z.object({
kind: z.enum(["pv", "engaged", "outbound", "custom"]),
ts: z.number().int().positive(),
path: z.string().min(1).max(2048).startsWith("/"),
ref: z.string().max(2048).optional(),
meta: z.record(z.union([z.string(), z.number(), z.boolean(), z.null()])).optional(),
});
const CollectRequestSchema = z.object({
site: z.string().min(1).max(64),
v: z.literal(1),
events: z.array(CollectEventSchema).min(1).max(50),
});The ingestion Worker parse()s with Zod before anything else. Malformed payloads get a 400 and a counter bump — never silent drops, never partial ingestion.
Linting and formatting
Biome. Not ESLint. Single tool for lint + format. Runs in CI as a gate, not a suggestion. Zero warnings policy — if it warns, either fix it or configure it out. No // biome-ignore without a comment explaining why.
Type boundaries
Every Worker ↔ DO ↔ D1 boundary has an explicit type. AggregatorMessage isn't just an interface in a file — it's the actual shape that gets validated at the DO's fetch() handler. If the ingestion Worker sends the wrong shape, it fails loudly at the boundary, not silently deep inside the aggregator.
// This is a type AND a runtime validator
const AggregatorMessageSchema = z.object({ ... });
type AggregatorMessage = z.infer<typeof AggregatorMessageSchema>;Types and validators derived from the same source. No drift.
Client script budget
The analytics client (a.js) has a hard size gate in CI:
# In CI pipeline
GZIP_SIZE=$(gzip -c dist/a.js | wc -c)
if [ "$GZIP_SIZE" -gt 2048 ]; then
echo "❌ a.js exceeds 2KB gzipped ($GZIP_SIZE bytes)"
exit 1
fiNot aspirational. A gate. If the script grows past 2KB gzipped, the build fails.
#13) Development process
Testing strategy
Three layers, each with a clear job:
| Layer | Tool | What it tests | When it runs |
|---|---|---|---|
| Unit | Vitest | Pure functions: classifyReferrer(), coalesceDeltas(), computeEngagement() | Every commit (CI) |
| Integration | Miniflare | Worker ↔ DO ↔ D1 round-trips: "event in → aggregate out → query returns correct count" | Every PR |
| Contract | Zod assertions | Boundary shapes: "the DO accepts exactly this shape and rejects everything else" | Every commit (CI) |
No E2E browser tests for v1. The client script is small enough to unit test the beacon logic. The server is tested through integration. Browser testing is cost that doesn't pay for itself at this scale.
What doesn't get tested: D1 SQL syntax (trust SQLite), Cloudflare routing config (tested by deployment), UI layout (there's barely any UI).
Environments
| Env | Purpose | D1 instance | URL |
|---|---|---|---|
| local | Development + integration tests | Miniflare (in-memory) | localhost:8787 |
| production | Live | Production D1 | analytics.bristanback.com |
Why not ephemeral deploys per branch? Solo developer, one blog. Miniflare replicates D1/DO behavior well enough for integration tests. Ephemeral preview environments are team infrastructure — real value when multiple people need to review deployments, overhead when it's just you. If this grows beyond a solo project, add a preview env with branch-scoped D1. The wrangler --env flag makes it trivial to add later.
CI pipeline
biome check → tsc --noEmit → vitest run → size gate (a.js) →
wrangler deploy --env preview (on PR)
wrangler deploy --env production (on merge to main)Every step is a gate. No "allowed to fail" steps.
Secrets management
No .env files in the repo. Ever. Wrangler secrets for API keys. Environment-specific config (site IDs, rate limits) via wrangler.jsonc env blocks. The only thing that differs between preview and production is the D1 binding and the domain — everything else is code.
#14) Data evolution and governance
Schema migrations
D1 doesn't have a migration runner. So we build one — minimal:
const MIGRATIONS = [
{ version: 1, sql: `CREATE TABLE IF NOT EXISTS site ...` },
{ version: 2, sql: `CREATE TABLE IF NOT EXISTS fact_daily_referrer ...` },
// ...
] as const;
async function migrate(db: D1Database): Promise<void> {
await db.exec(`CREATE TABLE IF NOT EXISTS _migrations (
version INTEGER PRIMARY KEY,
applied_at INTEGER NOT NULL
)`);
const applied = await db.prepare(`SELECT version FROM _migrations`).all();
const appliedSet = new Set(applied.results.map(r => r.version));
for (const m of MIGRATIONS) {
if (!appliedSet.has(m.version)) {
await db.exec(m.sql);
await db.prepare(`INSERT INTO _migrations (version, applied_at) VALUES (?, ?)`)
.bind(m.version, Date.now()).run();
}
}
}Runs on Worker startup (cached after first run). Forward-only — no down migrations. If a migration is wrong, ship a new forward migration that fixes it.
The rrdtool principle: fixed storage, automatic rollup
The oldest good idea in time-series storage is rrdtool's Round Robin Database (1999): allocate fixed space, and as data ages, roll it into coarser resolution. Graphite's whisper format does the same thing. The insight is that you never need per-minute granularity from six months ago — you need the shape, not the points.
This system applies the same principle to D1:
Resolution tiers:
0-14 days → hourly (fact_hourly)
0-90 days → daily + breakdowns (fact_daily_referrer, _country, _device)
0-∞ → daily core (fact_daily) — just path × pageviews × engagedAs data ages, it loses dimensions but never disappears. The Cron Trigger isn't just "deleting old data" — it's the rollup engine. Before deleting fact_hourly rows, it verifies they've been absorbed into fact_daily. Before deleting breakdown tables, the core fact still holds the totals.
The result: storage cost is roughly proportional to the number of paths (content), not the number of events (traffic). A blog with 50 posts and 5 years of history fits comfortably in D1's free tier. That's the rrdtool promise: time passes, storage doesn't grow.
Retention policy
| Table | Retention | Rollup target | Enforcement |
|---|---|---|---|
| fact_hourly | 14 days | fact_daily | Cron: verify daily absorbed, then delete |
| fact_daily_referrer | 90 days | fact_daily (totals preserved) | Cron trigger |
| fact_daily_country | 90 days | fact_daily (totals preserved) | Cron trigger |
| fact_daily_device | 90 days | fact_daily (totals preserved) | Cron trigger |
| fact_daily_vitals | 90 days | - (historical vitals less useful) | Cron trigger |
| fact_daily | Indefinite | - | Fixed cost ≈ rows × paths |
| dim_referrer.raw_host | 30 days | source enum preserved | Cron nullifies raw_host |
| flush_batch | 7 days | - | Cron trigger |
Retention is enforced by a scheduled Worker (Cron Trigger), not by hope. Each purge logs what it deleted for auditability. The Cron runs daily and is itself idempotent — safe to re-run, safe to miss a day.
Cardinality governance
Unbounded dimensions are how analytics systems die. Hard limits:
dim_path: max 10,000 per site (after that, new paths map to/other)dim_referrer: max 1,000 per site (unknown hosts →other)customeventmetakeys: max 10 per event, values max 256 chars
These aren't advisory. The ingestion Worker enforces them at write time. If a dimension table is full, new values get bucketed into the catch-all. An ops event fires so you know it happened.
Traffic classification
Cloudflare's Bot Management gives us a cf.botManagement.score (0-99) on every request. We should use it, but not as a binary filter — as a dimension.
CREATE TABLE IF NOT EXISTS fact_daily_traffic_class (
site_id TEXT NOT NULL,
date_utc TEXT NOT NULL,
path_id INTEGER NOT NULL,
traffic_class TEXT NOT NULL, -- human | likely_human | likely_bot | verified_bot | ai_crawler
pageviews INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (site_id, date_utc, path_id, traffic_class),
FOREIGN KEY(site_id) REFERENCES site(id),
FOREIGN KEY(path_id) REFERENCES dim_path(id)
);Classification logic:
type TrafficClass = "human" | "likely_human" | "likely_bot" | "verified_bot" | "ai_crawler";
function classifyTraffic(req: Request): TrafficClass {
const score = req.cf?.botManagement?.score ?? 50;
const verifiedBot = req.cf?.botManagement?.verifiedBot ?? false;
const ua = req.headers.get("user-agent") ?? "";
// AI crawlers: GPTBot, ClaudeBot, Google-Extended, Bytespider, CCBot, etc.
if (/GPTBot|ClaudeBot|Claude-Web|Google-Extended|Bytespider|CCBot|PerplexityBot|Amazonbot/i.test(ua)) {
return "ai_crawler";
}
if (verifiedBot) return "verified_bot"; // Googlebot, Bingbot, etc.
if (score >= 80) return "human";
if (score >= 30) return "likely_human";
return "likely_bot";
}This gives you answers to questions that actually matter in 2026:
- "How much of my traffic is AI crawlers vs humans?"
- "Is GPTBot reading my posts? Which ones?"
- "What's my real human readership after filtering bots?"
- "Are search engines still sending me traffic, or has AI eaten that?"
The traffic_class dimension is a v2 breakdown table — same pattern as referrer/country/device. Not in the core fact table, not on the hot path, but available when you want to ask the question.
Data export and portability
Monthly R2 snapshots (Phase 4) export the full D1 state as NDJSON. This is your escape hatch — if you outgrow D1, if Cloudflare changes pricing, if you want to move to ClickHouse someday. The data is always yours, always portable, always in a format that any system can ingest.
#Decisions
- Engagement: Composite boolean (scroll depth OR visible dwell OR outbound click), emitted once per pageview. (See collection protocol.)
- Bot filtering: Edge heuristics first (Cloudflare bot score + UA), engagement as secondary quality signal. No custom bot signature maintenance.
- Referrer normalization: Small enum (≤15 values), classified at ingestion. (See data model.)
- Multi-site: Shared DB with
site_idpartitioning. One blog now, column's already there if that changes. - Fact cardinality: v1 uses low-cardinality core fact table; breakdowns are separate optional tables.
- Write integrity: Aggregator flushes are idempotent and transactional via
flush_batch. - Validation: Zod at every boundary. Types and validators derived from the same schema — no drift.
- Testing: Unit (pure functions) + integration (Miniflare round-trips) + contract (Zod). No E2E browser tests at this scale.
- Environments: Miniflare for local dev and integration tests. No ephemeral deploys until the project outgrows solo development.
- Migrations: Forward-only, versioned, runs on startup. No down migrations.
- Traffic classification: 5-class system (human → ai_crawler) using
cf.botManagement.score+ UA patterns. Stored as a v2 breakdown dimension, not in core fact. - Data portability: Monthly NDJSON exports to R2. The data is always yours.
#Open questions
- How much hourly granularity is genuinely useful for bloggers? (14-day window feels right — enough for "is this post spiking?" without long-term storage cost.)
ShouldDecided: Hard budget of 8 tool calls peraskrun only against precomputed views to prevent query abuse?askrequest. The AI analyst gets a fixed number of moves — enough for compare + breakdown + follow-up, not enough to accidentally run up the D1 bill. If it can't answer in 8 calls, the question is too vague; ask the human to narrow it.- Dashboard surface: should this be MCP-only, or include a web UI?
Decision: MCP-first + tiny snapshot page.
- Phase 1 ships MCP tools as the primary interface (Telegram/OpenClaw queries)
- Also ship a minimal
/analyticssnapshot page (auth-gated) with 4 cards:- 7-day pageviews
- top posts
- top referrers
- week-over-week delta
Rationale: MCP-only is fastest and matches current workflow, but a tiny snapshot preserves glanceability without adding a heavy dashboard surface. Both MCP and snapshot use the same typed metrics/query layer, so no duplicate logic.
#What the client actually looks like
<script defer src="/a.js" data-site="bri-blog"></script>a.js does three things: capture a pageview, queue an engagement event, flush via beacon when the tab hides. Nothing else.
The point of this system is not to know everything.
It's to know enough, quickly, privately, and truthfully — then use AI to ask better questions over that evidence.
Not surveillance. Not theater. Just instrumentation with taste.
One of the first things I ever built for the web was a hit counter — Perl/CGI, flat file, incremented on every page load. I still get emails about it. Twenty-something years later, after the entire Google Analytics era — Urchin to UA to GA4 to ripping it all out — I'm back to building my own way to answer "did anyone read this?" The tools are different. The question hasn't changed.
#A note on scale and honesty
This system is for a personal blog that gets maybe 200 visits a day. At that volume, most of this architecture is overkill. You could write directly to D1 from the ingestion Worker with upserts and it'd be fine for years. And yes — writing a 3,000-word spec with Zod schemas, a Mermaid diagram, non-goals, and a migration plan for something most people would solve with Plausible and a weekend is peak RFC-maxxing. Noted.
I know that. The Durable Object aggregation layer, the governor mode, the flush batching — none of it is justified by current traffic. A blog doesn't need backpressure. It needs a database and a query.
But this is also a learning exercise. I want to understand Cloudflare's primitives — DOs, D1, Workers, Cron Triggers, R2 — by building something real with them, not by reading docs. And "real" means making architectural decisions that would matter at scale, even if they don't matter yet. The DO isn't here because I need a write buffer. It's here because I want to understand how write buffers behave on this platform — what the alarm API feels like, how DO ↔ D1 coordination works under contention, what happens when you actually need backpressure.
One known tradeoff: DO cold starts. At low traffic, the DO will frequently be evicted and wake cold for the first event of a session — maybe ~50ms of latency. But the client fires beacons and doesn't wait for responses, so the user never feels it. The cost is invisible; the buffer is just slightly lazier waking up.
The tradeoff is honest: I'm trading simplicity for education. The system is designed to handle tens to hundreds of thousands of events responsibly and resource-efficiently — not because my blog will generate them, but because the next thing I build on this stack might, and I'd rather learn the failure modes now with low stakes.
Where this gets dangerous is if the learning exercise calcifies into production complexity. The commitment: if the DO layer turns out to be pure overhead with no educational payoff, rip it out and go direct-to-D1. Architecture should serve the system, not the architect's curiosity. But right now, the curiosity is the point.
This is a working spec — part of the workshop. I'll follow up with what actually survived contact with reality once it's built.