ADR-0048 — `score` outcome kind and two-axis composition

ADRsUpdated 2026-05-22 EDT4 min readEdit on GitHub ↗

ADR-0048 — `score` outcome kind and two-axis composition #

Status: Accepted
Date: 2026-05-22
Deciders: Natan
Source: PRD-18 (metaintro-chat job-search SUT) needs a verdict

shape that the six-state outcome model (ADR-0020) cannot express: a continuous [0,1] number with a banded interpretation, emitted alongside (not in place of) the operational outcome.

Context #

SQA's verdict vocabulary today is ADR-0020's six discrete states: pass / warn / fail / skip / unknown / error. Every probe, every claim, every segment terminates in one of those six. This works for operational verification — "did the snapshot land in ClickHouse within tolerance?" is a yes/no question with a few edge-case escape hatches.

PRD-18 introduces a new shape of verdict. The user runs the job-search workflow on metaintro-chat; SQA renders a verdict like "Relevancy = 0.62, Coverage = 0.71, JSI = 0.65 (YELLOW)." The number is the verdict. There is no version of this where "0.62" collapses to pass without losing the signal that PRD-18 was filed to capture.

Two shapes were considered:

Option	Shape	Trade-off
A	New `score` outcome kind: `Result<{ value: number; band: 'green'	'yellow'	'red' }>`	Adds a seventh state. Forces ADR-0020 readers to learn one new word. Keeps the existing six untouched and unambiguous.
B	Stuff score into `pass` context as a metadata field	Zero vocabulary change. But the existing six states all mean "operational verdict on a binary question"; overloading `pass` to also carry a continuous score breaks that.
C	Two parallel pipelines (one for operational, one for score)	Doubles the renderer, doubles the persistence shape, doubles the tracker grammar. Fork cost is permanent.

Option A is the smallest cut that preserves the existing vocabulary's clarity. The "seventh state" is honest: score-bearing verdicts are a genuinely different kind of claim.

Decision #

Introduce a seventh outcome kind, score, with this shape:

type ScoreOutcome = {
  kind: 'score'
  value: number          // [0, 1]
  band: 'green' | 'yellow' | 'red'
  context: Record<string, unknown>  // axis-specific evidence
}

score outcomes never escalate to fail. A RED band is a RED band — it's a real signal that the SUT performed badly, but it doesn't crash the run or block the segment chain. Operational failures (network down, JME 500'd, browser crashed) still emit fail / error via the existing six states; those are separate from the score.

Two-axis composition #

A score-bearing claim emits both an operational outcome (from the existing six) and a score outcome. The parent step aggregates them as:

text

operational outcome = worst of children's operational outcomes
score outcome       = JSI-composed from children's scores
                      (per JSI v1.0 paper: geometric mean
                       with gate cap at 0.30)

This lets a single PRD-18 trial say "operationally pass (everything ran, no crashes), scored 0.42 RED (search results were bad)" — which is exactly the truthful verdict.

Falsifiability #

A score outcome whose band doesn't match its value

per the agreed thresholds (GREEN ≥0.75, YELLOW 0.55–0.75, RED <0.55) is malformed and must be rejected by result.ts.

A score outcome with value outside [0, 1] is malformed.
A parent step that aggregates score children but emits no

score of its own is a renderer bug.

Consequences #

Extending the existing six-state model is a clean extension, not a refactor. Total surface area: 5 files, ~15 LOC, all backward-compatible (2026-05-22 probe of src/lib/result.ts + src/lib/render/):

src/lib/result.ts:32 — extend the Outcome union from

src/lib/result.ts:59–84 — add a score() constructor function

alongside pass() / warn() / fail() / error() / unknown() / skip(). Signature: score(name: string, value: number, band: 'green'|'yellow'|'red', context?: …): Result.

src/lib/result.ts:120–127 — extend the SEVERITY ladder. Score

outcomes map to severity by band (green → 0 ≈ pass, yellow → 2 ≈ warn, red → 2 ≈ warn — never to 4 (fail) or 5 (error) per D6/this ADR's no-escalation rule). Also extend outcomeForSeverity() at result.ts:129–144.

src/lib/render/palette.ts:11–18 — extend the GLYPH map with

a score glyph (proposal: ◆ colored by band, or [G/Y/R] text). Extend the PALETTE map at lines 44–54 with score-band colors.

src/lib/render/index.ts:162–223 — extend the counts object

initialization (line 162) and the summary line (line 223) to include score=${counts.score}.

No standards-gate rule hard-codes the six-state list (confirmed by inspection of scripts/checks/standards.ts), so the gate needs no update. The CHANGELOG entry for W4 calls out the vocabulary change so future readers understand the new word.

ADR-0048 — score outcome kind and two-axis composition #