ADR-0020 — Six-state outcome model: split timeout from probe-error
ADR-0020 — Six-state outcome model: split timeout from probe-error #
- Status: Accepted
- Date: 2026-05-08
- Deciders: Natan
- Source: Martin Kleppmann, Designing Data-Intensive Applications ↗,
2nd ed. (O'Reilly, 2026), Chapter 9 — "The Trouble with Distributed Systems" ↗. Audit and synthesis captured in PRD-06 ↗.
- Supersedes (vocabulary, not body): the five-state model in
ADR-0012. ADR-0012's body is preserved per the immutability rule; the Outcome type and severity ordering are widened by this ADR.
Context
Context #
ADR-0012 ratified five outcomes: pass / warn / fail / error / skip. The split between fail and error was load-bearing: the system answered wrongly vs. we couldn't ask. That split caught the JUnit failed/error distinction and resolved the original "everything is false" ambiguity.
The May 2026 audit against Designing Data-Intensive Applications ↗ surfaced a finer epistemic distinction inside error that ADR-0012 had not named. Kleppmann is explicit (DDIA Ch. 9 ↗):
"When a request times out, six things could be true and you cannot tell which: request lost, request queued, remote crashed, remote paused (GC), response lost, response delayed... The only information you have is that you haven't received a response yet. If you send a request to another node and don't receive a response, it is impossible to tell why."
So a probe timeout is the absence of information, not failure. But ADR-0012's error includes both:
- Network failure —
AbortSignal.timeoutfires, ECONNREFUSED,
DNS failure. We don't know what the SUT did. Could mean it succeeded; could mean it crashed; could mean the network ate the response.
- Probe-side bug — JSON parse failure, type mismatch, missing
field, code error. The probe's own logic broke; the SUT may be perfectly fine.
The two collapse into error, which exits 1 in CI. Three concrete costs of the conflation:
- **CI flakes on transient network issues become indistinguishable
from real SUT bugs.** The same exit code; the same red checkmark. Operators who get paged at 3am can't tell which.
- Future SLO calculations inflate. If we ever compute
"fraction of api.ready events that passed in the last 30 days," error events count against the rate even though half of them were our flaky office wifi.
- Retry policy can't be expressed cleanly.
DDIA Ch. 2 ↗ says retry with backoff and jitter, but only on cases where retrying might help. A timeout might — the SUT might come back. A parse bug won't — the next attempt will fail the same way. Without a name for the difference, the retry rule has nowhere to attach.
We considered three options.
- Status quo. Accept the conflation. Cheapest. Fails on cost
(1) — CI flakes will continue.
- Treat
unknownas a sub-context withinerror. Add a
Result.context.errorClass: "timeout" | "probe-bug" | ... discriminator without changing the outcome enum. Lower-risk migration, but consumers (renderer, exit-code mapper, future SLO computer) all have to inspect a nested field. Drift-prone.
- Add
unknownas a sixth outcome. Higher-risk migration
(touches every probe + every test + every outcome consumer), but the discriminator is at the top level where every downstream tool naturally finds it.
Decision
Decision #
Adopt option 3. The Outcome enum becomes six-valued:
export type Outcome = "pass" | "warn" | "fail" | "unknown" | "error" | "skip";unknown semantics:
The probe could not get a successful response within its budget, but no probe-side bug occurred. Network timeout, ECONNREFUSED, DNS failure, TLS handshake timeout, or any thrown exception identifiable as transport-level. We don't know what the SUT did.
error semantics tighten:
The probe itself failed. JSON parse failure, missing field a probe assumed, type mismatch, code bug, internal invariant violation. Our code broke; the SUT's state is unknown but probably fine.
Severity ordering and exit code #
The new severity table:
| Outcome | Severity | DDIA framing | Exit code |
|---|---|---|---|
pass | 0 | Safety holds | 0 |
skip | 1 | Liveness deferred | 0 |
warn | 2 | Eventual consistency | 0 |
unknown | 3 | Absence of information (Ch. 9 ↗) | 0 |
fail | 4 | Safety violation | 1 |
error | 5 | Probe-side fault | 1 |
unknown exits 0 because we don't know if there's a problem. Treating it as exit 1 inflates the error rate during incidents that aren't ours. Operators who want to alert on unknown runs treat it as a separate availability signal, not a correctness one. The JSON envelope carries the count.
The skip-promotion rule from ADR-0012 ("composites whose worst child is skip promote to warn") is unchanged. unknown does not promote.
Catch-block discrimination #
Every probe's catch (err) block discriminates by error class:
} catch (err) {
if (isTransportError(err)) {
return unknown(NAME, transportReason(err), ctx);
}
return error(NAME, err, ctx);
}The isTransportError predicate (added to @lib/result.ts as part of PRD-06 ↗) recognizes:
AbortError/TimeoutError(fromAbortSignal.timeout)Errorwhosecodeis in `{ "ECONNREFUSED", "ECONNRESET",
"ETIMEDOUT", "EHOSTUNREACH", "ENETUNREACH", "ENOTFOUND" }`
- Driver-specific equivalents the major SDKs raise (mongo's
MongoServerSelectionError, ioredis's connection-refused exception, etc.) where they're greppable as a string match (heuristic, like the existing isAuthError patterns)
Anything else falls through to error. The classification is explicit and conservative: when in doubt, it's error, not unknown. We'd rather over-attribute to ourselves than to the network.
Migration cost #
ADR-0012's body is preserved (immutability rule). This ADR's Decision is the new vocabulary; ADR-0012's framing of "five outcomes, one envelope shape" is updated downstream:
src/lib/result.ts—Outcomewidens;unknown()constructor
added; severity table updated.
- All eight probe
catchblocks — discriminate on error class. exitCodeFor()— unchanged shape, six branches mapped explicitly.- All consumer surfaces — renderer (
render.ts), summary block,
five-case truth tables in tests.
- Glossary, CONTRIBUTING —
unknownrow added.
PRD-06 carries the migration as item 06.1.
Falsifiability #
This ADR holds if and only if:
- (a) Within six months, at least one CI flake is caught by
unknown (exit 0) that would have been a error (exit 1) under ADR-0012's model. Predicted because the audit identified live cases.
- (b) No probe's
catchblock lumps a timeout and a parse
error under one outcome after PRD-06 closes. Mechanically enforceable via Rule 40 ↗ in make standards.
- (c) Future retry logic, when it lands, only retries on
unknown outcomes (never on fail, never on error). Pinned in event-shape.md ↗ as the contract.
Revisit if (a) doesn't hold within six months — the conflation wasn't actually causing the false positives the audit predicted, so the migration cost wasn't justified.
Consequences
Consequences #
- Pro: The CI exit code reflects correctness, not
availability. A flaky probe network doesn't break the build.
- Pro: Future retry logic has a clean predicate
(outcome === "unknown").
- Pro: Future SLO calculation has a clean filter —
unknown
events excluded from correctness rate, included in availability rate, two separate metrics. Both queryable.
- Pro: The model lines up with DDIA Ch. 9's epistemic frame
exactly. New contributors who've read the book recognize the vocabulary.
- Pro: Better postmortems. A historical run that exited 0 with
unknown=3 tells you "there were three transient transport issues" without conflating them with "three SUT-side bugs".
- Con: Migration touches every probe (8 ×
catchblock) and
every test that asserts on outcome shape.
- Con: Heuristic error-class detection (regex on driver
exception strings) is fragile when SDKs change error messages. Mitigated: same pattern as the existing isAuthError classification; reviewable, fixable.
- Con: The five-case truth-table tests in
src/__tests__/lib/result.test.ts become six-case. Trivial cost.
See also
See also #
- ADR-0012 — the
five-state model this ADR widens.
- ADR-0014 — trace + span
model that carries the outcome.
- ADR-0019 — pipeline
egress shape; consumes the six-state events.
the contract for events carrying this outcome.
Rule 40 enforces the catch-block discrimination.
- PRD-06 ↗ —
the work that landed the migration plus retry/idempotency hygiene.
Ch. 9 ↗ — source of the epistemic frame.
the source for ADR-0012's original fail/error split this ADR builds on.