ADR-0015 — Segment concurrency: parallel by default, sequential by declaration
ADR-0015 — Segment concurrency: parallel by default, sequential by declaration #
- Status: Accepted
- Date: 2026-05-07
- Deciders: Natan
Context #
A segment runs N child checks. Today's runPreflight runs them in sequence with await between each step() call — first the API probe, then S3, then Mongo, etc. Every probe waits for the previous one to settle.
That's correct only if there is a causal dependency between probes. In snappy-api/preflight there isn't: the API probe's result has no bearing on whether to ask S3, or Mongo, or Redis. The sequence is just the order Natan typed them in.
The cost of needless sequencing is wall time. The current preflight takes ~10 s end-to-end against production because two probes (Mongo, ClickHouse) hit their 5-s timeouts and the rest queue behind them. Run in parallel, the same eight probes finish in ~5 s — bounded by the single slowest probe, not the sum.
Two reasons not to default to parallel:
- Real dependencies exist — a future "data probes" segment
might want to confirm a DB is reachable before counting rows.
- Log interleaving — sibling probes that run concurrently
produce logs in unpredictable order, which used to make traces hard to read.
ADR-0014 already solved (2): every log line carries traceId, spanId, parentSpanId, so a grep reconstructs causation regardless of physical line order. The cost of interleaving dropped to zero.
(1) is real but rare. The right default flips: parallel when children are independent (the common case), sequential when the segment author declares a dependency.
Decision #
A segment is a function returning Promise<Result>. It declares its concurrency mode by which composition helper it calls:
parallelize(name, factories)— runs all child factories with
Promise.all and aggregates. Use when children are independent.
sequentialize(name, factories)— runs each factory after the
previous one settles. Use when a child consumes another's result or must observe the system in a known state.
Both helpers live in @lib/result.ts (or a sibling @lib/segment.ts if it grows) and produce the same composite Result shape — the difference is execution, not output schema.
runPreflight switches to parallel: its eight probes are independent.
Naming #
The two helpers are deliberately verbs (parallelize, sequentialize) rather than properties on a config object ({ mode: "parallel" }). The verb form makes the choice obvious at the segment's only call site — you can't accidentally inherit the "wrong" mode from a parent context, and a code reviewer can spot the choice in one glance.
Step IDs survive #
Step IDs (1.1, 1.2, …) keep their declaration order even when spans run concurrently. The number is the author's intended order, not the physical execution order. Walking the Result tree by step ID gives the canonical reading; walking it by observedStartedAt gives the actual execution timeline. Both are useful, both are present.
Failure mode #
In parallel, every child runs to completion even if one fails. Aggregation is unchanged: the parent's outcome is the worst child's outcome (per ADR-0012's severity ordering). This is deliberate — the value of running 8 probes in parallel is that you see all 8 results, not just the first failure. Fail-fast short-circuiting would defeat the point.
Promise.all rejecting on the first throw isn't an issue because checks are pure (ADR-0012); they never throw. If a buggy probe does throw, the segment's catch reports it as the segment's own error outcome and the whole tree still aggregates cleanly.
Trace concurrency #
Each child runs inside its own AsyncLocalStorage scope (step() calls runInSpan per-child), so sibling spans get distinct spanIds with the same parentSpanId. The trace is correct even when logs interleave.
Consequences #
- Pro: ~10 s preflight → ~5 s preflight against production.
Bounded by the slowest probe, not the sum. Wins compound as more probes are added.
- Pro: Children always all run, even if one fails. A first-time
reader of a failing run sees every diagnosis at once instead of having to fix one issue, re-run, fix the next.
- Pro: The choice between parallel and sequential is explicit
per segment, declared at the call site by which helper is used. No global flag, no inherited config.
- Pro: Step IDs (
1.1,1.2, …) stay meaningful as
declaration order. Tracing fields tell the actual execution story.
- Con: Parallel runs put 8 simultaneous connections to 8
external systems. Mostly fine in production (these are health probes, not load tests), but a runaway concurrency pattern in a future "many-system" PRD could DDOS infrastructure. Mitigation: PRDs that fan out to dozens of probes use a bounded parallelize variant with a concurrency limit (not built today; build when needed).
- Con: A segment author who reaches for
parallelizewhen a
real dependency exists will silently produce wrong results. Mitigated by: the rule of thumb is "can I write this segment as N independent factories with no shared state?" If yes, parallel. If no, sequential. The PR review for any new segment names which helper was chosen.
- Falsifiability: Revisit if (a) we hit a real DDOS-shape
concurrency issue against a real provider, at which point we add bounded parallelism; or (b) a segment author repeatedly uses sequentialize without a real dependency, suggesting the default flipped the wrong way (it didn't).
See also #
the first parallel segment.
src/lib/result.ts↗ —parallelize,
sequentialize, aggregate.
- ADR-0012 — pure
functions returning Result. Required: parallel composition only works because checks don't throw or share state.
- ADR-0014 — tracing makes
interleaved sibling logs readable. Required: parallel made interleaving normal, tracing made interleaving free to read.