ADR-0014 — Every run is a trace; every step is a span
ADR-0014 — Every run is a trace; every step is a span #
- Status: Accepted
- Date: 2026-05-07
- Deciders: Natan
Context #
A run isn't one event — it's a causal tree. Each leaf probe runs inside a segment, inside a system flow, inside a top-level run, with a trigger that caused the whole thing to start. SQA already produced a Result tree (ADR-0012) and observed-timing fields (ADR-0013), but nothing in the run carried:
- A run identifier. No way to grep a single run out of the logs.
- Span identity per
step(). A log line emitted inside
mongo.ready couldn't be tied back to "step 1.3 of run R" beyond the human step ID, which isn't unique across runs.
- Parent-child causation. Reconstruction of the tree from logs
needed each span to point at its parent.
- Concurrency disambiguation. The moment preflight goes parallel
(which is the right default), sibling spans overlap in the log; without span IDs you can't tell which [mongo.ready] line belongs to which run, let alone which span.
- The trigger. Not all systems have request IDs or session IDs
pointing at SQA. SQA has to carry who/why initiated this run itself — humans, cron, CI, incident response — or the data disappears at the moment we need it most.
The problem statement, in Natan's words: "not all systems have tracing IDs, request IDs. When we put the flow or system together we're the one who needs to understand who triggered what and the chain of reaction. We have to map it in the end."
This is distributed tracing, scoped to one process. The shape is industry-standard (W3C Trace Context ↗, OpenTelemetry); we're adopting the vocabulary so that future export to a real backend (Tempo / Honeycomb / Loki) is mechanical.
Decision #
Every SQA run is a trace. Every step() call inside the run is a span. The shape mirrors OpenTelemetry's data model so future export costs nothing semantic.
IDs (W3C / OTel format) #
| Field | Format | Where it's set |
|---|---|---|
traceId | 16 hex chars (64-bit; pad to 128 on export) | Once per run, in the runner. |
spanId | 8 hex chars (64-bit) | Once per step() call. |
parentSpanId | 8 hex chars | Inherited from the enclosing span. |
OTel-format over UUIDs because the day we export to a real backend, the IDs are already in the right shape. No remapping.
Trigger taxonomy #
Every run carries a Trigger captured at startup, never thrown away. Lives on the root span and surfaces in the top-level Result.context.trigger.
kind | When it's set |
|---|---|
human | TTY detected, or SQA_TRIGGER=human. |
cron | SQA_TRIGGER=cron (never auto-detected). |
ci | CI env vars detected, or SQA_TRIGGER=ci. |
incident | SQA_TRIGGER=incident (operator running it). |
unknown | None of the above; warn at startup. |
Detection priority: explicit SQA_TRIGGER env > CI heuristic (CI=true, GITHUB_ACTIONS=true, etc.) > TTY heuristic (stdout.isTTY)
unknownwith a warn. Never default tocron— explicit is
better, and a missing trigger during a real outage shouldn't masquerade as a routine cron run.
The trigger record carries optional identity: user ($USER), host ($HOSTNAME), ciBuild ($GITHUB_RUN_ID / equivalent), cronId ($SQA_CRON_ID), incidentRef ($SQA_INCIDENT_REF).
Propagation: AsyncLocalStorage #
Parent-span context propagates via Node's built-in AsyncLocalStorage. step() reads the active span on entry and pushes a child context for fn's execution. Probe signatures stay clean — no ctx argument threaded through every call. This matches the OTel SDK convention and is zero-dependency.
Logging #
Every log line emitted inside a step() carries:
traceId,spanId,parentSpanId(when present)stepId(the human handle:1.3,2.1.4)description(free-text from thestep()caller)phase:start(debug) /ok(info) /error(error)- existing observed-timing fields (
durationMs, etc.)
grep traceId=<hex> reconstructs the whole run. grep spanId=<hex> isolates one step. grep parentSpanId=<hex> finds children of a given span.
Result enrichment #
The runner pins traceId, rootSpanId, trigger, startedAt, completedAt, and observedDurationMs onto the top-level Result's context so the JSON envelope is self-correlating: a downstream reader doesn't need the logs to know which run it's looking at.
Inner Results carry their own observedStartedAt / observedCompletedAt / observedDurationMs per ADR-0013, plus the ambient traceId/spanId via the structured-log fields above.
Consequences #
- Pro: A single grep turns a flat log into a causal tree.
Sibling spans in parallel preflight don't confuse the reader.
- Pro: The top-level JSON envelope is self-correlating — a
postmortem reader knows trace-id, trigger, and outcome from one blob without having to find the logs.
- Pro: OTel-format IDs mean future export to Tempo / Honeycomb /
Loki is mechanical. Adding a span exporter is a follow-up; we don't pay for it now.
- Pro: Trigger captured cheaply at startup. If the question
"why did this run happen?" arrives during an incident, the answer is on the run itself.
- Con: Every log line gets longer (5–6 extra fields). Acceptable
in JSON mode (the default for non-TTY runs); the pino-pretty config in lib/logger.ts already filters most fields out of the rendered console output.
- Con:
AsyncLocalStorageis implicit propagation. A future
contributor who awaits across an unusual boundary (workers, spawned processes) might lose context. Mitigated by: (a) step() is the only producer; (b) we log a warning if a step runs with traceId === "untraced".
- Falsifiability: Revisit if (a) a real consumer needs OTel
semantic conventions on attributes (we currently emit ad-hoc field names), at which point we add a thin attribute-namespace layer; or (b) the unknown trigger fires during a real cron run repeatedly, meaning the heuristic isn't catching legitimate cron triggers — fix the heuristic, not the classification.
See also #
src/lib/trace.ts↗ —startTrace,
currentSpan, runInSpan, detectTrigger, ID generators, Trigger and SpanContext types.
src/lib/step.ts↗ — opens a span around
every fn, stamps log lines with trace+span+parent.
src/index.ts↗ — callsstartTrace()once,
pins the trigger and trace identity onto the root Result.
- ADR-0012 — Result
envelope (the what).
- ADR-0013 — observed
timing (the when).
- W3C Trace Context ↗ — the
industry-standard ID format we adopted.
— vocabulary and structure we mirror.