ADR-0014 — Every run is a trace; every step is a span

ADRsUpdated 2026-05-07 15:53 EDT5 min readEdit on GitHub ↗

ADR-0014 — Every run is a trace; every step is a span #

Status: Accepted
Date: 2026-05-07
Deciders: Natan

Context #

A run isn't one event — it's a causal tree. Each leaf probe runs inside a segment, inside a system flow, inside a top-level run, with a trigger that caused the whole thing to start. SQA already produced a Result tree (ADR-0012) and observed-timing fields (ADR-0013), but nothing in the run carried:

A run identifier. No way to grep a single run out of the logs.
Span identity per step(). A log line emitted inside

mongo.ready couldn't be tied back to "step 1.3 of run R" beyond the human step ID, which isn't unique across runs.

Parent-child causation. Reconstruction of the tree from logs

needed each span to point at its parent.

Concurrency disambiguation. The moment preflight goes parallel

(which is the right default), sibling spans overlap in the log; without span IDs you can't tell which [mongo.ready] line belongs to which run, let alone which span.

The trigger. Not all systems have request IDs or session IDs

pointing at SQA. SQA has to carry who/why initiated this run itself — humans, cron, CI, incident response — or the data disappears at the moment we need it most.

The problem statement, in Natan's words: "not all systems have tracing IDs, request IDs. When we put the flow or system together we're the one who needs to understand who triggered what and the chain of reaction. We have to map it in the end."

This is distributed tracing, scoped to one process. The shape is industry-standard (W3C Trace Context ↗, OpenTelemetry); we're adopting the vocabulary so that future export to a real backend (Tempo / Honeycomb / Loki) is mechanical.

Decision #

Every SQA run is a trace. Every step() call inside the run is a span. The shape mirrors OpenTelemetry's data model so future export costs nothing semantic.

IDs (W3C / OTel format) #

Field	Format	Where it's set
`traceId`	16 hex chars (64-bit; pad to 128 on export)	Once per run, in the runner.
`spanId`	8 hex chars (64-bit)	Once per `step()` call.
`parentSpanId`	8 hex chars	Inherited from the enclosing span.

OTel-format over UUIDs because the day we export to a real backend, the IDs are already in the right shape. No remapping.

Trigger taxonomy #

Every run carries a Trigger captured at startup, never thrown away. Lives on the root span and surfaces in the top-level Result.context.trigger.

`kind`	When it's set
`human`	TTY detected, or `SQA_TRIGGER=human`.
`cron`	`SQA_TRIGGER=cron` (never auto-detected).
`ci`	CI env vars detected, or `SQA_TRIGGER=ci`.
`incident`	`SQA_TRIGGER=incident` (operator running it).
`unknown`	None of the above; warn at startup.

Detection priority: explicit SQA_TRIGGER env > CI heuristic (CI=true, GITHUB_ACTIONS=true, etc.) > TTY heuristic (stdout.isTTY)

unknown with a warn. Never default to cron — explicit is

better, and a missing trigger during a real outage shouldn't masquerade as a routine cron run.

The trigger record carries optional identity: user ($USER), host ($HOSTNAME), ciBuild ($GITHUB_RUN_ID / equivalent), cronId ($SQA_CRON_ID), incidentRef ($SQA_INCIDENT_REF).

Propagation: AsyncLocalStorage #

Parent-span context propagates via Node's built-in AsyncLocalStorage. step() reads the active span on entry and pushes a child context for fn's execution. Probe signatures stay clean — no ctx argument threaded through every call. This matches the OTel SDK convention and is zero-dependency.

Logging #

Every log line emitted inside a step() carries:

traceId, spanId, parentSpanId (when present)
stepId (the human handle: 1.3, 2.1.4)
description (free-text from the step() caller)
phase: start (debug) / ok (info) / error (error)
existing observed-timing fields (durationMs, etc.)

grep traceId=<hex> reconstructs the whole run. grep spanId=<hex> isolates one step. grep parentSpanId=<hex> finds children of a given span.

Result enrichment #

The runner pins traceId, rootSpanId, trigger, startedAt, completedAt, and observedDurationMs onto the top-level Result's context so the JSON envelope is self-correlating: a downstream reader doesn't need the logs to know which run it's looking at.

Inner Results carry their own observedStartedAt / observedCompletedAt / observedDurationMs per ADR-0013, plus the ambient traceId/spanId via the structured-log fields above.

Consequences #

Pro: A single grep turns a flat log into a causal tree.

Sibling spans in parallel preflight don't confuse the reader.

Pro: The top-level JSON envelope is self-correlating — a

postmortem reader knows trace-id, trigger, and outcome from one blob without having to find the logs.

Pro: OTel-format IDs mean future export to Tempo / Honeycomb /

Loki is mechanical. Adding a span exporter is a follow-up; we don't pay for it now.

Pro: Trigger captured cheaply at startup. If the question

"why did this run happen?" arrives during an incident, the answer is on the run itself.

Con: Every log line gets longer (5–6 extra fields). Acceptable

in JSON mode (the default for non-TTY runs); the pino-pretty config in lib/logger.ts already filters most fields out of the rendered console output.

Con: AsyncLocalStorage is implicit propagation. A future

contributor who awaits across an unusual boundary (workers, spawned processes) might lose context. Mitigated by: (a) step() is the only producer; (b) we log a warning if a step runs with traceId === "untraced".

Falsifiability: Revisit if (a) a real consumer needs OTel

semantic conventions on attributes (we currently emit ad-hoc field names), at which point we add a thin attribute-namespace layer; or (b) the unknown trigger fires during a real cron run repeatedly, meaning the heuristic isn't catching legitimate cron triggers — fix the heuristic, not the classification.