ADR-0033 — LLM outputs are evaluated against an external rubric, not asserted against fixed strings

ADRsUpdated 2026-05-20 11:30 EDT7 min readEdit on GitHub ↗

5 sections··

ADR-0033 — LLM outputs are evaluated against an external rubric, not asserted against fixed strings #

Status: Accepted
Date: 2026-05-20
Deciders: Natan
Source: ai-engineering Proposal 1

(docs/research/2026-05-09-ai-engineering-dossier.md ↗ lines 271–331). Pre-commitment ratification ahead of PRD-09 §09.2.* verifier code per Convergence F framing in the master synthesis (docs/research/2026-05-09-master-synthesis.md §2.6) — the claim shape lands before the first LLM-touched claim is authored, so verifier code does not inherit assertEquals(expected, actual) muscle memory from the activation contract.

Context

Context #

SQA today has zero LLM-touching claims. The DomainDiscoveryEngine contract (docs/contracts/snappy/domain-discovery-engine.md ↗) is functional-correctness only: every field it asserts is a number in a typed range, an enum value, a regex match, or a sha256 equality — things a string-equality assertion can express without category error.

PRD-09 (DataExtractionEngine) is the first SQA contract that observes a field an LLM wrote. Snappy's auto-extract workflow runs a Gemini Flash enrichment pass on snapshots whose CQI lands in 10 ≤ cqi < 50 (per apps/snappy-api/docs/extraction/llm-enrichment-boundary.md) and persists the LLM's outputs to organizations.fieldMeta[F] with source === "llm". The verifier that observes those fields cannot use actual === "Acme Inc" as the assertion: the LLM can return "Acme, Inc.", "Acme Inc.", "Acme Incorporated", or "ACME INC" — all correct in the sense that matters, all unequal to a fixed expectation string.

ai-engineering Proposal 1 (lines 271–331) names the failure mode and the alternative: claims on LLM-touched fields are expressed as rubrics (typed predicates), AI-judge calls (with declared stability), or comparative evaluations (two SUT runs scored against each other) — never as equality to a fixed string. The deterministic instance of the rubric category is Phoenix Ch 3 line 4085's substring-citation rule ↗ (prompt-engineering Proposal 1): "for every non-empty field, the matching entry in sourceSpans must be a literal substring of the input." Snappy's prompt already enforces this rule into the LLM (see snappy-api-deep-dive §8.2, org-extraction.ts:160-177 rule 2); SQA's verifier closes the loop from outside by asserting snapshotBody.toLowerCase().includes(sourceSpan.toLowerCase()).

Without this ADR landing before PRD-09 §09.2.* code, the first verifier author would reach for expect(value).toBe("Acme Inc") and the contract would immediately false-fail against legitimate LLM-correct outputs. Convergence F's "framing ADR first" pattern exists precisely to head that off.

Decision

Decision #

SQA adopts ai-engineering Proposal 1's three clauses verbatim as the binding stance on LLM-output verification.

Clause 1 — Claim-shape #

Any SQA contract claim that observes a field produced by an LLM (any field snappy wrote with fieldMeta[F].source === "llm") MUST be expressible as one of:

1a. Rubric-based — a typed predicate over the field value:

range check, enum check, regex match, format match, presence of a structural sibling field (e.g. a sourceSpan).

1b. AI-judge-based — a named external judge model returns a

numeric or categorical score; the judge's identity and parameters are pinned per Clause 2.

1c. Comparative — two SUT runs (or two LLM outputs from the

same SUT) are scored against each other, not against a fixed expectation.

Naive string-equality (actual === "expected string") on an LLM-touched field is forbidden. A verifier that wants to assert the LLM returned the literal string "Apple Inc" is in category error; the right assertion is either value matches /^Apple/i (1a rubric) or judge(value, "is this Apple Inc?") >= threshold (1b judge).

The deterministic instance of Clause 1a is Phoenix Ch 3 line 4085's substring-citation rule: for every LLM-filled field F, fieldMeta[F].sourceSpan is a literal substring of the snapshot's input HTML. This is a two-String.prototype.includes assertion — deterministic, cheap, no judge required. PRD-09 §09.2.4 (C2) is the first consumer.

Clause 2 — Judge-stability #

Any Clause 1b-shaped claim MUST persist a stability triple alongside the score:

text

judge: {
  modelId: string;            // e.g. "anthropic/claude-opus-4@v20260520"
  promptSha: string;          // sha256 of the judge prompt template
  samplingParamsSha: string;  // sha256 of {temperature, topP, topK, seed?}
}

A judge call is not stable until those three fields are pinned. A gap row from a 1b-shaped claim that lacks the triple is itself contract-invalid — the verifier emits error, not fail/pass.

SQA does not run AI judges today; Clause 2 is a pre-emption so when the first judge-based claim lands (likely as a follow-up to PRD-09 once the framework cost is amortised), the field shape is already settled and the gap-row schema (docs/concepts/event-shape.md) does not need to be re-rev'd.

Sentinel promptSha: "vendor-managed:<vendor>:<endpoint>" is allowed for managed-prompt endpoints (e.g. OpenAI moderation, where the prompt is private to the vendor). The sentinel preserves the field's structural presence; downstream consumers can route those rows separately if needed.

Clause 3 — Reference-data out-of-scope #

Reference-based evaluation — "compare the SUT's output to a canonical correct answer from a curated dataset" — is out of scope for SQA. SQA observes production; there is no canonical "right answer" to a snappy extraction. Ground truth lives in snappy's own offline eval framework or in upstream benchmark suites; SQA observes outputs against rubric / judge / comparative criteria only.

A future contract author who wants to assert "the LLM should return X for input Y" is implicitly asking for a reference dataset. Clause 3 routes that request out — either snappy adds the assertion to its offline eval, or the claim is re-shaped as a rubric (1a), judge (1b), or comparative (1c).

Consequences

Consequences #

Positive #

PRD-09 inherits the right vocabulary. The §09.2.* verifier

code (C1 rubric over typed predicates; C2 substring-citation rubric; C3 hash-equality which is not LLM-touched; C5 mongo side-effect rubric) all compose under Clauses 1–3 without exception.

Future LLM-touched contracts compose. When detect-career

or classify-industry lands as its own SUT contract, the authors land in the same vocabulary — no re-ratification needed.

Gap-row schema is stable. Clause 2's triple shape is decided

before the first judge call writes one; no schema migration when the first 1b-claim ships.

One ADR closes a class of false-fails. Without this stance,

every LLM-touched verifier risks the "Apple, Inc." vs "Apple Inc" false-fail; with it, that bug class is fenced at the contract layer.

Negative #

**No quick expect(value).toBe("Acme Inc") shortcuts for LLM

fields.** Verifier code carries more rubric scaffolding (typed predicates, regex matchers, structural-sibling checks) than the activation contract did. This is the intended cost.

Some functionally-correct fields are harder to assert.

Industry classification (an enum among ~30 categories per snappy's registry) is a 1a rubric (value ∈ ENUM), which is easy. But a free-text description's "correctness" is rubric-hard — the verifier asserts presence + length-bound + sourceSpan, not semantic correctness. ADR-0033 accepts that bound: SQA observes what's structurally assertable from outside; semantic correctness routes to snappy's offline eval (Clause 3).

Neutral #

Judge-call infrastructure deferred. Clause 2 names the

field shape but no Clause 1b claim lands in this PRD. The first AI-judge claim will land in a follow-up PRD once the framework cost (judge-runner harness, prompt-template versioning, sampling- param hashing) is justified by ≥2 consuming claims.

No code change today. ADR-0033 is additive doc. Its

consequences materialise as PRD-09 §09.2.* lands.

Alternatives considered

Alternatives considered #

Allow assertEquals on "obvious" fields (e.g. name,

headquarters). Rejected — the LLM can hallucinate any field, including obvious ones, and a rubric must scale uniformly. Per- field carve-outs create cognitive load on the verifier author ("is name covered? is tagline?") that the uniform rubric rule avoids.

Single rubric clause without Clause 2 (judge-stability).

Rejected. An unstable judge — one whose model id, prompt, or sampling params drift between runs — is worse than no judge, because the score looks authoritative but the call is irreproducible. Landing Clause 2 alongside Clause 1 keeps the judge surface honest from the first claim onward.

Lift the reference-data clause (Clause 3) and let SQA curate

fixture corpora. Rejected. SQA's external-observer charter (ADR-0027) precludes maintaining ground-truth datasets; that responsibility lives with the SUT owner. Folding it into SQA doubles the maintenance surface and creates a "two sources of truth" problem on extraction quality.