ADR-0039 — Continuous tail: sampling model, persistence target, alerting boundary
ADR-0039 — Continuous tail: sampling model, persistence target, alerting boundary #
- Status: Proposed (graduates to Accepted per PRD-21 §21.5.3 after
one week of live tail data validates the page-ladder threshold, the dampening clause, and the new-event watchlist warn against prod baseline variance)
- Date: 2026-05-13
- Amendments:
- 2026-05-13 (commit
84b539a) — table namespace renamed from
- 2026-05-13 (commit
snappy.sqa_runs to sqa.sqa_runs (and snappy.sqa_sweep_summary to sqa.sqa_sweep_summary) to separate SQA's own scoreboard tables from snappy's SUT data. The DDL and query examples in §2 / §6 still read snappy.sqa_runs for historical fidelity; all live runbooks, dashboards, and alert rules now reference sqa.sqa_runs. Treat the namespace prefix as the only delta; schema and engine are unchanged.
- Deciders: Natan
- Source: PRD-21 §21.1.1
(docs/prds/21-continuous-tail.md ↗), post-sweep confidence assessment (2026-05-13).
Context
Context #
PRD-21 turns SQA from a manual inspection tool (make sweep, human-driven, point-in-time) into a continuous production guardrail: a sampled fraction of real prod activations runs through SQA's verifier stack as activations land, with results persisted to ClickHouse, surfaced on a Grafana dashboard, and alerted on three ladders.
The empirical motivation is the 2026-05-12 499-domain F500 sweep. That sweep caught P6 / P8 / P10 only because someone ran it manually. The failure shape is concrete: a snappy regression ships at 09:00, the next manual sweep runs at 14:00, we learn at 14:30 — five hours of degraded prod activations before SQA's verifier sees them. Synthetic monitoring tools (Pingdom, Datadog Synthetics) close this gap by running checks on a schedule; the analogue for SQA is a continuous tail.
Three design questions distinguish the tail from the sweep, and they cannot be answered independently of each other. The sampling discipline determines what shape of data lands in storage; the persistence target determines what kind of queries the alerting layer can ask; the alerting layer determines what volume of data we have to retain. They are one coupled decision, and this ADR records it as one.
Adjacent boundaries:
- ADR-0007 — the tail
observes the same operational invariants the sweep verifies; it does not expand scope into adversarial state or audience- impression work.
- ADR-0028 R1 — load-
bearing decisions get an ADR. Sampling model + persistence target + alerting boundary all qualify and ship in one document rather than three, because they are coupled.
- ADR-0037 — `@metaintro/snappy-
contracts is the typed home for claims C1–C7 and the STRUCTURED_EVENT` vocabulary. The tail runner consumes those same claims; the C7 forward-highlight (a new structured event in the package automatically surfaces in SQA's trace) is the same forcing function the dashboard's new-event-watchlist panel and the warn-ladder alert rely on.
Decision
Decision #
Five coupled rules. Each is load-bearing; relaxing any of them either re-opens the drift surface PRD-17 closed, breaks the reproducibility property the sweep relied on, or makes the alert layer untrustworthy.
1. Sampling: hash-on-domainId modulo 100 #
sampledDomainId(domainId) = hash(domainId) % 100 < sampleRatePct. Default sampleRatePct = 1. The hash function is FNV-1a over the UTF-8 bytes of the domainId string (a 24-character Mongo ObjectId hex string). FNV-1a is chosen for two reasons: (a) it is parameterless, so there is no key-rotation surface to manage, and (b) it is stable across process restarts and across language boundaries — a future Go tail runner reading the same domainId would produce the same sample bit.
Three properties follow:
- Deterministic per-domain. A given
domainIdis always
sampled or always skipped within a config epoch (an epoch is a sampleRatePct value). The same domain that produced a failure yesterday is the same domain we re-verify today, and diffs across snappy versions are pinned by domainId. This is the reproducibility property the sweep relied on, preserved for the tail.
- Uniform across the domain population. Hot domains (single
tenant slammed with reactivations) do not oversample, because the hash key is domainId, not the activation event. Cold domains do not undersample for the same reason.
- Bumping the rate widens the window. Moving from 1 → 5
changes the inequality from < 1 to < 5. No resampling logic, no replay buffer, no migration. The 1% subset is a prefix of the 5% subset.
Alternatives considered are recorded under §Alternatives below.
2. Persistence: ClickHouse snappy.sqa_runs, ReplacingMergeTree on traceId #
The tail writes each verifier outcome to a new ClickHouse table snappy.sqa_runs, schema:
CREATE TABLE snappy.sqa_runs (
traceId String,
domainId String,
fqdn String,
snappyVersion LowCardinality(String),
outcome LowCardinality(String), -- pass | warn | fail | skip | unknown | error
terminal LowCardinality(String), -- active | blocked | unresolved | <unset>
gapCount UInt16,
failingClaim LowCardinality(String), -- '' | C1 | C2 | C4 | C5 | C6 | C7
failingReason String, -- first failing claim's reason string
structuredEvents Array(LowCardinality(String)),
durationMs UInt32,
startedAt DateTime64(3),
completedAt DateTime64(3),
result_json String CODEC(ZSTD(3))
)
ENGINE = ReplacingMergeTree(completedAt)
PARTITION BY toYYYYMM(completedAt)
ORDER BY (traceId)
TTL completedAt + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;Three properties this schema buys:
- Idempotency lives in the storage engine, not the application.
ReplacingMergeTree(completedAt) ORDERed by traceId means that if the tail runner inserts the same traceId twice (a restart mid-flush, a retry from a transient network error), the table eventually collapses to one row per traceId, with the newest completedAt winning. The application code does not need a "have I already written this traceId?" check before insert. Per Natan's note: the temptation to author application-side dedup is wrong — CH already has the right primitive, and reimplementing it at the runner layer adds a failure mode (lost dedup state on runner restart) without adding any property the engine doesn't already provide.
- Slice-by-time and slice-by-version queries are one-liners.
Partition by month + LowCardinality on snappyVersion and outcome means the dashboard queries (panel set in §21.3) and alert rules (§21.4) read from the same partitioned table with no special-casing.
- Retention is bounded. 90d TTL is the default; long enough
to cover the "what changed three weeks ago when version 0.X.Y shipped?" question, short enough that storage cost stays in the noise (~1% of prod mean 8 KB per row 90d ≈ low GB). Bumping retention is a single ALTER TABLE … MODIFY TTL ….
Alternatives considered (Loki, a new Postgres / Mongo store, application-level dedup) are recorded under §Alternatives below.
3. Insert path: batched writes, warn-loud on failure, no crash #
The new helper at src/components/clickhouse/insert.ts is the sibling of clickhouse/ready.ts (which is read-only). Three rules:
- Batched. Configurable flush interval, default 1–2s. The
runner buffers rows in memory and flushes either when the interval elapses or when the buffer crosses a max-rows threshold (default 100). At 1% of prod (~thousands/day), this amortizes the network round-trip without holding rows in memory long enough to lose them on a crash. The interval is configurable because the right value depends on prod volume, which we will learn after a week of live tail data.
- Warn-loud on failure. A CH insert failure logs at WARN with
the failing batch's traceIds, increments a counter exposed to the dashboard, and does not crash the tail runner. The reason: the tail is a guardrail, not a critical path. If CH is down, dropping a few minutes of tail observations is acceptable; taking down the runner means we drop everything until someone notices and restarts it. Warn-loud means the dashboard's CH- insert-failure-rate panel goes red and we see the gap; silent drop would be a worse failure mode than crash, but warn-loud is better than both.
- No application-level dedup. Per §2, idempotency is the
storage engine's job. The insert helper retries failed batches with exponential backoff; if the second insert lands a duplicate of the first, ReplacingMergeTree collapses it on merge. The runner never has to remember which traceIds it has already flushed.
4. Event stream: ClickHouse polling, tactical #
The tail runner discovers candidate domainIds by polling the existing snappy.domains_analytics table on insert-time, with a configurable interval (default 5s). Pseudocode:
const cursor = await fetchMaxCompletedAt(); // from snappy.sqa_runs
while (true) {
const newRows = await query(`
SELECT id, fqdn, status, updated_at
FROM snappy.domains_analytics
WHERE updated_at > $cursor AND status IN ('active','blocked','unresolved')
ORDER BY updated_at ASC
LIMIT 1000
`);
for (const row of newRows) {
if (!sampledDomainId(row.id)) continue;
await runPostHocVerify(row.id);
}
cursor = max(newRows.updated_at) ?? cursor;
await sleep(5_000);
}This choice is tactical, not architectural. Hatchet workflow-completion hooks would give lower latency (seconds instead of seconds-plus-poll-interval) and would remove the CH poll load. We pick polling for the bootstrap because it requires zero snappy code changes, the latency cost is acceptable inside a 30min alert window, and the load impact of a 5s poll against the already-warm domains_analytics partition is small. The trade-off — operational simplicity now, latency cost later — is explicit and reversible. The Falsifiability section names the empirical signal that would prove this wrong and force the switch to Hatchet hooks.
5. Alerting: three ladders, conservative dampening #
Three ladder rules query snappy.sqa_runs and fire on threshold + duration. Each rule's evidence payload includes the recent failing claims grouped by reason so the on-call has a starting point.
| Ladder | Threshold | Window | Destination |
|---|---|---|---|
| Page | fail rate ≥ 5% | 30min sustained | PagerDuty |
| Warn | non-pass rate ≥ 15% OR new structured event name | 60min sustained | Slack #sqa-tail |
| Info | everything else (per-domain anomalies, single fails) | — | dashboard only |
Three properties:
- **Page ladder is the "snappy regression actually shipped"
signal.** 5% over 30min eliminates per-domain noise and one-off C7 cascades from weird F500 tenants. It is sized to fire on a real regression that affects multiple domains for tens of minutes — the failure mode the manual sweep was catching at a five-hour lag.
- Warn ladder catches drift from snappy without us looking.
The "new structured event name" branch is the C7 forward- highlight realized as alerting: STRUCTURED_EVENT_NAMES is imported from @metaintro/snappy-contracts, so when snappy publishes a new event in the package, the tail's vocabulary updates on dep bump; until that bump lands in SQA, the new event name is unknown and trips the warn. This is the intended forcing function — a snappy contract change without a matching SQA dep bump is a Slack message, not a silent miss.
- Info ladder is the dashboard. Single-row fails during the
workday don't page and don't Slack; they show up on the panel set in §21.3 and on the on-call's regular dashboard scan.
Dampening — conservative by construction. Suppress paging only when all failures in the alert window are attributable to the suppression set (canonical-merge via the activate.canonical- merged structured event, known-soft drift like the cdn enum miss or vocab evolution). Not "most," not "majority" — all.
The reasoning is asymmetric: a real outage that incidentally produces some canonical-merge noise is the failure mode the "majority" rule would suppress, and that is the failure we cannot afford to miss. Over-dampening produces silent under-paging during an incident; under-dampening produces a chatty Slack channel for a week. The chatty channel is recoverable (tune the rule); a missed page is not (the incident burns down before anyone looks). We pick under-dampening.
When dampening fires it logs the suppression decision with the list of traceIds suppressed and the structured event names they carried. A weekly review of the dampening log catches over- suppression by inspection: if real fails are being suppressed, they show up in the log alongside the noise. The dampening rule is observable, not opaque.
Consequences
Consequences #
Architectural #
- The tail is a sibling of the sweep, not a replacement. Sweep
takes a domain list and creates activations; tail takes existing activations and verifies them. Both share the same verifier surface (@metaintro/snappy-contracts claims C1–C7, the validateClaim API, the gap log). Sweep stays the diagnostic tool for "I want to inspect domain X under a specific snappy version"; tail is the always-on guardrail.
- **The post-hoc verifier mode lands as a refactor of `src/
systems/snappy/domain-activation.ts.** The five verify* functions today are already shape-correct: each takes a domainId and returns a Result. The refactor extracts them (plus buildClaimContext, readTerminalState, resultFromClaim) into named exports, then the tail runner seeds scenario state (snappy.domainId, snappy.terminalState, snappy.canonicalMerged) from snappy's truth (read via Mongo on the sampled domainId`) instead of from SQA's drive phase. The verifier functions themselves are unchanged. This preserves the sweep code path bit-for-bit; the tail enters at step 2.3.
- One forcing function, three consumers. The C7 forward-
highlight from PRD-17 — adding an event name to STRUCTURED_EVENT in the contracts package automatically surfaces in SQA's trace — is now the source for three downstream consumers: the C7 verifier in domain-activation.ts (which iterates STRUCTURED_EVENT_NAMES), the new-event- watchlist panel in the Grafana dashboard, and the warn-ladder alert rule. A snappy team adding a new structured event lights up all three at once without an SQA code change.
- **CH is the data layer for both the tail and its own
observability.** The dashboard's CH-insert-failure-rate panel reads from a CH counter; the alert rules read from snappy.sqa_runs. CH being healthy is a prerequisite for the tail being trustworthy. The tail does not depend on CH for its verifier surface — C1, C2, C4, C5, C7 all read other stores — so a CH outage degrades the alert layer but does not blind the verifier itself. The verifier still runs; results buffer in memory and best-effort batch on recovery.
Behavioural #
- Snappy regressions surface in minutes, not hours. At 1%
sampling with thousands of prod activations per day, the expected time to first observation of any given regression class is on the order of minutes. The page-ladder window (30 min sustained) means the on-call is paged inside an hour of the regression shipping, not five hours after the next manual sweep.
- Dashboard becomes the daily-ops surface. The four-panel set
(outcome rate, C-claim failure rate, terminal-state distribution, new-event watchlist) is the morning scan. New structured events showing up on the watchlist without a matching dep-bump PR are the visible record of contract evolution that has not yet propagated to SQA.
- The runbook becomes load-bearing. PRD-21 §21.5.2 specifies
docs/runbooks/continuous-tail.md covering tail config (sample rate, retention), silencing alerts during planned deploys, inspecting a failing run from dashboard back to source result_json, widening sampling for deep-dives. The first on-call rotation that takes a tail page will be reading that runbook; it has to be operationally complete on day one.
Falsifiability #
Each bullet names a concrete empirical signal that would prove the corresponding decision wrong.
- Sampling. If a future audit shows the 1% subset is biased
against a class of domainId shapes — for example, a systematic under-sampling of domains created by a specific snappy code path because their ObjectIds cluster in a hash bucket — this ADR has failed and the sampling function needs to move to a keyed hash (e.g. SipHash with a rotated key per epoch). FNV-1a is chosen on the assumption that ObjectId entropy is uniform across the 24-char hex space; that assumption is empirical and can be invalidated.
- Persistence. If the application code grows a `seen-
traceIds set (in-memory or in Redis) to dedupe inserts, this ADR has failed — ReplacingMergeTree` was supposed to be the single dedup point, and a parallel application-side dedup is the signal that the storage engine's primitive was insufficient and we silently grew a second one.
- Persistence (TTL). If retention is bumped to >180d without
a documented reason, this ADR has failed — 90d is the intentional choice (cheap, bounded), and a quiet bump indicates either uncritical retention growth or a use case this ADR didn't anticipate (in which case the use case belongs in a new ADR).
- Insert path. If a CH insert failure crashes the tail
runner, this ADR has failed — warn-loud-don't-crash is the documented contract. A crash on insert failure is the alternative we explicitly rejected.
- Event stream (the tactical decision). If production event
volume causes the 5s CH poll to consume above 2% of CH's CPU budget on the snappy cluster sustained for >24h, this ADR has failed and we move to Hatchet workflow-completion hooks. The 2% threshold is conservative: the poll hits a warm partition (snappy.domains_analytics ORDERed on updated_at), the LIMIT is bounded, and the typical query cost is microseconds. If empirical load exceeds the budget, the latency-cost trade-off we picked is wrong and Hatchet hooks become non-negotiable.
- Event stream (the other failure mode). If the 5s polling
interval produces a tail-to-activation latency p95 above 60s sustained for a week, this ADR has failed — the alerting windows (30/60min) assume single-digit-minute observation latency, and slipping to 60s+ means we are not paging on the regression we are alerting on; we are paging on a stale picture of it.
- Alerting (page ladder). If a tail page fires for a
non-incident (an audit-traceable suppression case where the failure was fully canonical-merge or fully known-soft drift), this ADR has failed — dampening is supposed to prevent this, and a page on a non-incident is the on-call's loss-of-trust signal. Conversely, if a real incident produces no page because dampening suppressed it, this ADR has failed in the opposite direction — the conservative-by-construction rule was meant to make under-paging impossible.
- Alerting (warn ladder, new-event watchlist). If a new
structured event name appears in production for >24h without either (a) tripping the warn-ladder alert or (b) being present in the SQA-pinned version of @metaintro/snappy- contracts, this ADR has failed — the C7 forward-highlight was supposed to be the source for this branch, and a silent miss means either the dep-bump cadence is broken or the vocabulary import is.
- Dampening (over-suppression). If a weekly review of the
dampening log shows a real failure being suppressed because the conservative-by-construction "all" rule was relaxed to "majority" or "most" without an ADR change, this ADR has failed — the rule is the contract, and quietly weakening it loses the property that motivated it.
Alternatives considered
Alternatives considered #
Sampling #
- Uniform random sampling per activation event. Rejected.
Gives smoother variance reduction (the Bernoulli draw is independent per event) but loses reproducibility — a specific domain's results across snappy versions cannot be compared without a separate replay log. We picked hash-on-domainId for the reproducibility property, matching how the sweep relied on a deterministic domain list.
- Reservoir sampling. Rejected. Reservoir sampling is for
bounded-memory uniform sampling from an unbounded stream when storage is the constraint. Here the constraint is verifier compute, not storage; reservoir doesn't apply.
- **Stratified sampling (sample by snappy-version / terminal-
state cohort).** Rejected for the bootstrap. Stratification adds complexity (the strata definitions become an independent surface to maintain) and the 1% baseline already over-samples low-frequency cohorts in absolute terms. Stratification is a future ADR if cohort imbalance becomes the limiting factor on signal quality, which can be measured from the tail itself.
Persistence #
- Loki. Rejected. Loki is line-oriented; full
result.json
trees fit poorly. The dashboard queries we actually want (group-by snappyVersion, time-series of C-claim failure rate) are not Loki's strong suit. Loki stays SQA's verifier target for C7 — but not SQA's tail storage.
- Postgres or Mongo (a new store). Rejected. Adds infra
(deploy, backup, monitoring, retention policy) without solving a problem ClickHouse doesn't already solve. ClickHouse is in the stack; the operational footprint of the tail is one new table on an existing cluster, not a new deploy.
- **Application-level dedup (in-memory Set or Redis on
traceId).** Rejected — see Decision §2. Dedup belongs in the storage engine. Application-side dedup adds a failure mode (lost state on restart, race between runner instances) without adding any property ReplacingMergeTree doesn't provide.
MergeTree+ manualOPTIMIZE TABLE … DEDUPLICATE.
Rejected. ReplacingMergeTree does the same thing without the cron job. The "manual dedup" approach trades a single-table operational primitive for an external scheduler.
Insert path #
- Synchronous insert per row. Rejected. At 1% of prod
(~thousands/day) the per-row HTTP round-trip is acceptable in absolute terms but pointlessly inflates CH's request rate when a 1–2s batch absorbs it cleanly. The latency cost (batch is held in memory for ≤2s) is invisible to the dashboard, which queries on completedAt not flushedAt.
- CH
INSERT … ASYNC. Rejected for the bootstrap. CH's
async-insert mode does the batching server-side, which is correct in principle, but requires a server-side configuration change and gives weaker per-batch acknowledgement semantics. Application-side batching is the simpler bootstrap; async-insert is a follow-up if the buffer-in-memory loss window becomes a real concern.
- Crash-on-insert-failure (no warn-loud). Rejected. Crashing
the runner on a CH outage degrades the failure mode from "a gap in tail observations" to "no tail at all until someone restarts." The guardrail-not-critical-path framing is the reason: the verifier is the value, the CH sink is one of three downstream consumers (dashboard, alert, future query), and losing the sink is recoverable.
Event stream #
- Hatchet workflow-completion hook. Deferred, not rejected.
Lower latency, no CH poll load. Requires a snappy code change (workflow emits an HTTP POST or a Kafka message on completion, tail consumes it). Worth doing if the Falsifiability bullet about CH poll load fires; not worth doing at bootstrap when CH polling is zero-snappy-touch.
- **Subscribing to MongoDB change streams on the
snappy.domains
collection.** Rejected. Doable but pulls SQA into a snappy internal seam (the Mongo collection name is implementation, not contract). CH snappy.domains_analytics is the published surface; reading it is the same direction as a downstream consumer of snappy's data, not an internal observer of snappy's storage.
- **Periodic batch (e.g. every 5min run a SELECT and verify the
batch).** Rejected. Loses the "continuous" property. The point of the tail is to close the manual-sweep latency gap; a 5min batch is a smaller version of the same problem.
Alerting #
- Static thresholds (page on any single fail). Rejected.
Per-domain noise produces pages on non-incidents; on-call loses trust in the alert.
- Anomaly detection (ML on the failure-rate time series).
Rejected for the bootstrap. Anomaly detection has tuning surface (training window, sensitivity) that a static threshold avoids. Three ladders with explicit thresholds are inspectable: an on-call can read the alert evidence and verify the threshold was actually crossed. ML black-box alerting is a much larger trust surface and a future ADR if the static-threshold false-positive rate proves intolerable.
- Dampen on "majority of failures are in suppression set."
Rejected — see Decision §5. Under-pages during an incident that incidentally produces canonical-merge noise. The cost asymmetry (missed page vs chatty Slack) makes the "all-or-none" rule the right default.
- PagerDuty + Slack + email tiering. Rejected. Email is a
silent channel that nobody reads. The two-tier (PagerDuty + Slack) plus dashboard is enough; adding a third produces a tier that does nothing different from the dashboard.
See also
See also #
- PRD-21 ↗ — the PRD this ADR ships
under, including the twelve sub-items across §21.0–§21.5.
- ADR-0007 — scope
boundary; the tail observes operational invariants only.
- ADR-0037 — `@metaintro/snappy-
contracts is the authorship home for C1–C7 and the STRUCTURED_EVENT` vocabulary the tail verifies against and the new-event watchlist surfaces.
src/runners/sweep/index.ts↗ — the sibling subsystem
whose parent-spawn shape src/runners/tail/index.ts mirrors.
src/systems/snappy/domain-activation.ts↗ — the verifier file the post-hoc mode refactors (extracts
the five verify* exports + context helpers).
src/components/clickhouse/ready.ts↗ — the read-only sibling whose oauth-and-fetch
pattern clickhouse/insert.ts mirrors.