ADR-0048 — `score` outcome kind and two-axis composition
ADR-0048 — score outcome kind and two-axis composition #
- Status: Accepted
- Date: 2026-05-22
- Deciders: Natan
- Source: PRD-18 (
metaintro-chatjob-search SUT) needs a verdict
shape that the six-state outcome model (ADR-0020) cannot express: a continuous [0,1] number with a banded interpretation, emitted alongside (not in place of) the operational outcome.
Context #
SQA's verdict vocabulary today is ADR-0020's six discrete states: pass / warn / fail / skip / unknown / error. Every probe, every claim, every segment terminates in one of those six. This works for operational verification — "did the snapshot land in ClickHouse within tolerance?" is a yes/no question with a few edge-case escape hatches.
PRD-18 introduces a new shape of verdict. The user runs the job-search workflow on metaintro-chat; SQA renders a verdict like "Relevancy = 0.62, Coverage = 0.71, JSI = 0.65 (YELLOW)." The number is the verdict. There is no version of this where "0.62" collapses to pass without losing the signal that PRD-18 was filed to capture.
Two shapes were considered:
| Option | Shape | Trade-off | ||
|---|---|---|---|---|
| A | New score outcome kind: `Result<{ value: number; band: 'green' | 'yellow' | 'red' }>` | Adds a seventh state. Forces ADR-0020 readers to learn one new word. Keeps the existing six untouched and unambiguous. |
| B | Stuff score into pass context as a metadata field | Zero vocabulary change. But the existing six states all mean "operational verdict on a binary question"; overloading pass to also carry a continuous score breaks that. | ||
| C | Two parallel pipelines (one for operational, one for score) | Doubles the renderer, doubles the persistence shape, doubles the tracker grammar. Fork cost is permanent. |
Option A is the smallest cut that preserves the existing vocabulary's clarity. The "seventh state" is honest: score-bearing verdicts are a genuinely different kind of claim.
Decision #
Introduce a seventh outcome kind, score, with this shape:
type ScoreOutcome = {
kind: 'score'
value: number // [0, 1]
band: 'green' | 'yellow' | 'red'
context: Record<string, unknown> // axis-specific evidence
}score outcomes never escalate to fail. A RED band is a RED band — it's a real signal that the SUT performed badly, but it doesn't crash the run or block the segment chain. Operational failures (network down, JME 500'd, browser crashed) still emit fail / error via the existing six states; those are separate from the score.
Two-axis composition #
A score-bearing claim emits both an operational outcome (from the existing six) and a score outcome. The parent step aggregates them as:
operational outcome = worst of children's operational outcomes
score outcome = JSI-composed from children's scores
(per JSI v1.0 paper: geometric mean
with gate cap at 0.30)This lets a single PRD-18 trial say "operationally pass (everything ran, no crashes), scored 0.42 RED (search results were bad)" — which is exactly the truthful verdict.
Falsifiability #
- A
scoreoutcome whosebanddoesn't match itsvalue
per the agreed thresholds (GREEN ≥0.75, YELLOW 0.55–0.75, RED <0.55) is malformed and must be rejected by result.ts.
- A
scoreoutcome withvalueoutside [0, 1] is malformed. - A parent step that aggregates score children but emits no
score of its own is a renderer bug.
Consequences #
Extending the existing six-state model is a clean extension, not a refactor. Total surface area: 5 files, ~15 LOC, all backward-compatible (2026-05-22 probe of src/lib/result.ts + src/lib/render/):
src/lib/result.ts:32— extend theOutcomeunion from
"pass" | "warn" | "fail" | "unknown" | "error" | "skip" to also include "score".
src/lib/result.ts:59–84— add ascore()constructor function
alongside pass() / warn() / fail() / error() / unknown() / skip(). Signature: score(name: string, value: number, band: 'green'|'yellow'|'red', context?: …): Result.
src/lib/result.ts:120–127— extend theSEVERITYladder. Score
outcomes map to severity by band (green → 0 ≈ pass, yellow → 2 ≈ warn, red → 2 ≈ warn — never to 4 (fail) or 5 (error) per D6/this ADR's no-escalation rule). Also extend outcomeForSeverity() at result.ts:129–144.
src/lib/render/palette.ts:11–18— extend theGLYPHmap with
a score glyph (proposal: ◆ colored by band, or [G/Y/R] text). Extend the PALETTE map at lines 44–54 with score-band colors.
src/lib/render/index.ts:162–223— extend the counts object
initialization (line 162) and the summary line (line 223) to include score=${counts.score}.
No standards-gate rule hard-codes the six-state list (confirmed by inspection of scripts/checks/standards.ts), so the gate needs no update. The CHANGELOG entry for W4 calls out the vocabulary change so future readers understand the new word.
See also #
- ADR-0020 (six-state outcome model — this ADR extends it)
- ADR-0027 (gap-row evidence model — score's
contextfield
is where JSI's per-axis evidence lands)
- PRD-18 §18.X D6 (decision row that locked this shape)
~/Lab/workspaces/_live/jsi/paper/paper.md(JSI v1.0 math)