ADR-0020 — Six-state outcome model: split timeout from probe-error

ADRsUpdated 2026-05-08 12:05 EDT7 min readEdit on GitHub ↗

4 sections··

ADR-0020 — Six-state outcome model: split timeout from probe-error #

Status: Accepted
Date: 2026-05-08
Deciders: Natan
Source: Martin Kleppmann, Designing Data-Intensive Applications ↗,

2nd ed. (O'Reilly, 2026), Chapter 9 — "The Trouble with Distributed Systems" ↗. Audit and synthesis captured in PRD-06 ↗.

Supersedes (vocabulary, not body): the five-state model in

ADR-0012. ADR-0012's body is preserved per the immutability rule; the Outcome type and severity ordering are widened by this ADR.

Context

Context #

ADR-0012 ratified five outcomes: pass / warn / fail / error / skip. The split between fail and error was load-bearing: the system answered wrongly vs. we couldn't ask. That split caught the JUnit failed/error distinction and resolved the original "everything is false" ambiguity.

The May 2026 audit against Designing Data-Intensive Applications ↗ surfaced a finer epistemic distinction inside error that ADR-0012 had not named. Kleppmann is explicit (DDIA Ch. 9 ↗):

"When a request times out, six things could be true and you cannot tell which: request lost, request queued, remote crashed, remote paused (GC), response lost, response delayed... The only information you have is that you haven't received a response yet. If you send a request to another node and don't receive a response, it is impossible to tell why."

So a probe timeout is the absence of information, not failure. But ADR-0012's error includes both:

Network failure — AbortSignal.timeout fires, ECONNREFUSED,

DNS failure. We don't know what the SUT did. Could mean it succeeded; could mean it crashed; could mean the network ate the response.

Probe-side bug — JSON parse failure, type mismatch, missing

field, code error. The probe's own logic broke; the SUT may be perfectly fine.

The two collapse into error, which exits 1 in CI. Three concrete costs of the conflation:

**CI flakes on transient network issues become indistinguishable

from real SUT bugs.** The same exit code; the same red checkmark. Operators who get paged at 3am can't tell which.

Future SLO calculations inflate. If we ever compute

"fraction of api.ready events that passed in the last 30 days," error events count against the rate even though half of them were our flaky office wifi.

Retry policy can't be expressed cleanly.

DDIA Ch. 2 ↗ says retry with backoff and jitter, but only on cases where retrying might help. A timeout might — the SUT might come back. A parse bug won't — the next attempt will fail the same way. Without a name for the difference, the retry rule has nowhere to attach.

We considered three options.

Status quo. Accept the conflation. Cheapest. Fails on cost

(1) — CI flakes will continue.

Treat unknown as a sub-context within error. Add a

Result.context.errorClass: "timeout" | "probe-bug" | ... discriminator without changing the outcome enum. Lower-risk migration, but consumers (renderer, exit-code mapper, future SLO computer) all have to inspect a nested field. Drift-prone.

Add unknown as a sixth outcome. Higher-risk migration

(touches every probe + every test + every outcome consumer), but the discriminator is at the top level where every downstream tool naturally finds it.

Decision

Decision #

Adopt option 3. The Outcome enum becomes six-valued:

export type Outcome = "pass" | "warn" | "fail" | "unknown" | "error" | "skip";

unknown semantics:

The probe could not get a successful response within its budget, but no probe-side bug occurred. Network timeout, ECONNREFUSED, DNS failure, TLS handshake timeout, or any thrown exception identifiable as transport-level. We don't know what the SUT did.

error semantics tighten:

The probe itself failed. JSON parse failure, missing field a probe assumed, type mismatch, code bug, internal invariant violation. Our code broke; the SUT's state is unknown but probably fine.

Severity ordering and exit code #

The new severity table:

Outcome	Severity	DDIA framing	Exit code
`pass`	0	Safety holds	0
`skip`	1	Liveness deferred	0
`warn`	2	Eventual consistency	0
`unknown`	3	Absence of information (Ch. 9 ↗)	0
`fail`	4	Safety violation	1
`error`	5	Probe-side fault	1

unknown exits 0 because we don't know if there's a problem. Treating it as exit 1 inflates the error rate during incidents that aren't ours. Operators who want to alert on unknown runs treat it as a separate availability signal, not a correctness one. The JSON envelope carries the count.

The skip-promotion rule from ADR-0012 ("composites whose worst child is skip promote to warn") is unchanged. unknown does not promote.

Catch-block discrimination #

Every probe's catch (err) block discriminates by error class:

} catch (err) {
  if (isTransportError(err)) {
    return unknown(NAME, transportReason(err), ctx);
  }
  return error(NAME, err, ctx);
}

The isTransportError predicate (added to @lib/result.ts as part of PRD-06 ↗) recognizes:

AbortError / TimeoutError (from AbortSignal.timeout)
Error whose code is in `{ "ECONNREFUSED", "ECONNRESET",

"ETIMEDOUT", "EHOSTUNREACH", "ENETUNREACH", "ENOTFOUND" }`

Driver-specific equivalents the major SDKs raise (mongo's

MongoServerSelectionError, ioredis's connection-refused exception, etc.) where they're greppable as a string match (heuristic, like the existing isAuthError patterns)

Anything else falls through to error. The classification is explicit and conservative: when in doubt, it's error, not unknown. We'd rather over-attribute to ourselves than to the network.

Migration cost #

ADR-0012's body is preserved (immutability rule). This ADR's Decision is the new vocabulary; ADR-0012's framing of "five outcomes, one envelope shape" is updated downstream:

src/lib/result.ts — Outcome widens; unknown() constructor

added; severity table updated.

All eight probe catch blocks — discriminate on error class.
exitCodeFor() — unchanged shape, six branches mapped explicitly.
All consumer surfaces — renderer (render.ts), summary block,

five-case truth tables in tests.

Glossary, CONTRIBUTING — unknown row added.

PRD-06 carries the migration as item 06.1.

Falsifiability #

This ADR holds if and only if:

(a) Within six months, at least one CI flake is caught by

unknown (exit 0) that would have been a error (exit 1) under ADR-0012's model. Predicted because the audit identified live cases.

(b) No probe's catch block lumps a timeout and a parse

error under one outcome after PRD-06 closes. Mechanically enforceable via Rule 40 ↗ in make standards.

(c) Future retry logic, when it lands, only retries on

unknown outcomes (never on fail, never on error). Pinned in event-shape.md ↗ as the contract.

Revisit if (a) doesn't hold within six months — the conflation wasn't actually causing the false positives the audit predicted, so the migration cost wasn't justified.

Consequences

Consequences #

Pro: The CI exit code reflects correctness, not

availability. A flaky probe network doesn't break the build.

Pro: Future retry logic has a clean predicate

(outcome === "unknown").

Pro: Future SLO calculation has a clean filter — unknown

events excluded from correctness rate, included in availability rate, two separate metrics. Both queryable.

Pro: The model lines up with DDIA Ch. 9's epistemic frame

exactly. New contributors who've read the book recognize the vocabulary.

Pro: Better postmortems. A historical run that exited 0 with

unknown=3 tells you "there were three transient transport issues" without conflating them with "three SUT-side bugs".

Con: Migration touches every probe (8 × catch block) and

every test that asserts on outcome shape.

Con: Heuristic error-class detection (regex on driver

exception strings) is fragile when SDKs change error messages. Mitigated: same pattern as the existing isAuthError classification; reviewable, fixable.

Con: The five-case truth-table tests in

src/__tests__/lib/result.test.ts become six-case. Trivial cost.