Glossary
Glossary #
Diátaxis form: reference. The single canonical vocabulary for SQA. Every other doc - PRDs, ADRs, guides, README, hardening-run, the web portal - uses these words and only these words for the concepts they name. If a term feels missing or wrong, change this file first, then sweep. This file merges the three former glossary docs (the layer glossary, the glossary-and-IA doc, and the deep-dives) into one source of truth; the long-form treatments for Contract, Verdict, and Metrics live in glossary-deep-dives.md.
The terms below cover three layers: the structure of an SQA run (probe → segment → run), the verdict it produces (outcome, score, band, verdict), and the contract it judges against (client, claim, falsifier, gap). Scope note: SQA verifies whether a system delivers the value it claims to its users, including graded, LLM-judged quality - see ADR-0052, which broadened the original operational-only scope of ADR-0007.
How the layers fit together
How the layers fit together #
flowchart TB
Run["run<br/>(one invocation, one trace, one exit code)"]
System["system / SUT<br/>(snappy)"]
Run --> System
subgraph SystemBox [" "]
direction TB
SegA["segment: preflight<br/>(parallel probes)"]
SegB["segment: domain-discovery-engine<br/>(synthetic transaction)"]
System --> SegA
System --> SegB
subgraph PreflightChildren ["component probes"]
direction LR
ReadyOwn["snappy.ready"]
ProbeS3["s3.ready"]
ProbeMongo["mongo.ready"]
ProbeEtc["…6 more"]
end
SegA --> ReadyOwn
SegA --> ProbeS3
SegA --> ProbeMongo
SegA --> ProbeEtc
subgraph ScenarioChildren ["one file: drive → observe → cleanup"]
direction TB
Step1["2.1 drive: POST /domains"]
Step2["2.2 observe: poll status"]
Step3["2.3 observe: history"]
Step4["2.4 cleanup: DELETE"]
Step1 --> Step2 --> Step3 --> Step4
end
SegB --> Step1
end
subgraph Components ["src/components/<name>/"]
direction LR
CompS3[s3]
CompMongo[mongodb]
CompLoki[loki]
CompMore["…"]
end
subgraph SUTAPI ["SUT public API"]
SUTEP["snappy HTTP endpoints"]
end
ProbeS3 -.HTTP/SDK.-> CompS3
ProbeMongo -.SDK.-> CompMongo
ReadyOwn -.HTTP.-> SUTEP
Step1 -.HTTP.-> SUTEP
Step2 -.HTTP.-> SUTEP
Step3 -.HTTP.-> SUTEP
Step4 -.HTTP.-> SUTEP
Outcome["Result envelope (seven-state)<br/>pass / warn / fail / unknown / error / skip / score<br/>(ADR-0020 + ADR-0048)"]
Step4 --> Outcome
ProbeEtc --> Outcome
ReadyOwn --> Outcome
Outcome --> ExitCode["process.exitCode<br/>0 (pass/warn/skip/unknown/score) / 1 (fail/error)"]
Two kinds of segment, two kinds of work. Preflight composes component probes (each asks one question of one dependency). The domain-discovery-engine segment wraps one scenario - a synthetic transaction: drive → observe → verify → cleanup, all in one file, returning one composite Result.
How to read the diagram:
- A run drives the SUT through one or more segments.
- Each segment composes children of one kind: a bundle of
probes (preflight), or a single scenario (a synthetic transaction that drives the SUT and observes what it did). The run's segments collectively cover both kinds.
- A scenario is internally composed of steps (drive /
observe / verify / cleanup), not of segments. Steps nest with dotted IDs (2.1, 2.1.3).
- Every leaf returns a
Resultwith one of seven outcomes
(six categorical, per ADR-0020, plus the continuous score, per ADR-0048). Outcomes aggregate up the tree (worst child wins, with the skip→warn promotion rule). The root's outcome maps to the process exit code.
- Probes talk to components over the wire (HTTP, SDK, gRPC).
Scenarios talk to the SUT's public surface and may also re-query components to confirm side-effects landed where the SUT claimed.
How a run becomes a verdict
How a run becomes a verdict #
The structural layer answers did it run?; the verdict layer answers did it deliver? Both ship on every run:
flowchart LR Scenario["Scenario"] -->|governed by| Contract["Contract"] Contract -->|asserts| Claim["Claim (Strength · Status · Falsifier)"] Run["Run"] -->|executes| Scenario Run -->|judged by| Judge["Judge (LLM ensemble or rule)"] Judge -->|produces| Score["Score ∈ [0,1]"] Score -->|banded into| Band["Band (green / yellow / red)"] Run -->|produces| Outcome["Outcome (7-state)"] Outcome --> Verdict["Verdict (human summary)"] Band --> Verdict Claim -->|when falsified| Gap["Gap"] Gap -->|cites| Evidence["Evidence"]
A run executes a scenario that a contract governs. A judge grades the scenario's evidence against the contract's claims, producing a score that bands into green/yellow/red. The machine reads the outcome ("did the gate pass?"); the human reads the verdict ("what happened?"), which packages outcome + score + band into one legible sentence. A falsified claim becomes a gap, which cites the evidence that refutes it.
Structure - the layers of a run #
probe
probe #
The smallest unit of work that asks one question of a component of the system. One function in one file, returns one Result. Pure: no throws, no side effects beyond logging, same input → same output shape (per ADR-0012).
Lives at src/components/<system>/<verb>.ts (a component-of-the-SUT probe; the typical case) or at src/systems/<sut>/ready.ts (a probe against the SUT's own surface - the SUT is itself one of its components, observed from outside via its public API).
Examples: s3.ready, mongodb.ready, loki.ready, snappy.ready (the SUT's /health/ready).
The word matches Kubernetes (livenessProbe / readinessProbe), Prometheus Blackbox (/probe endpoint), and the Google SRE book. The IETF/Nagios noun "check" means the same thing in those ecosystems but is too broad in English (every conditional is a "check") - we don't use it as a layer-1 noun.
scenario
scenario #
A synthetic transaction that drives the SUT through its public API to exercise a business behavior, then observes the SUT's response and verifies the side-effects. One scenario = one file = one composite Result.
Lives at src/systems/<sut>/<scenario-name>.ts. Named after the SUT and the behavior it drives (domain-discovery-engine, web-crawl-engine, metaintro-chat), not after any component or phase.
A scenario is composed of three kinds of internal phases - these are roles (and steps) inside the scenario file, not separate layers, and not segments:
- drive - a SUT-mutating call (POST, DELETE).
- observe - a read against the SUT's own surface or its
observability components (poll the API for state, query Loki for log lines, read Hatchet for workflow runs). The SUT narrating what it just did.
- verify - a read against a component the SUT was expected to
write to (Mongo collection, ClickHouse table, S3 prefix). The side-effects actually landed.
A scenario typically runs drive → observe → (optional verify) → cleanup, with state threaded between phases (a domainId produced by drive becomes input to observe and cleanup). The file aggregates the phase Results into one composite Result; the scenario reports pass only when every phase passed.
Why one file, not separate folders. The literature is unanimous: a synthetic transaction is one logical unit. Sam Newman's Building Microservices (2nd ed., ch. 10, "Semantic Monitoring") and Google's SRE book (ch. 17, "Black-Box Monitoring") both treat the phases as roles inside one transaction, not as a folder layer. Splitting drive/observe/verify into separate directories fragments the transaction and makes shared state harder to thread. We follow the books.
A scenario is not a probe: probes ask "is X up?" while scenarios ask "does the SUT do Y when poked?" A scenario can be slotted into a run as one segment.
segment
segment #
The composition layer. Composes one kind of child for one SUT and aggregates their Results into one composite Result. Pure: segments don't throw, they aggregate (ADR-0012). Each segment is either a bundle of probes or a single scenario - not both at once; a run's segments collectively span both kinds.
Lives at src/systems/<sut>/<segment>.ts. Today snappy has two segments:
preflight.ts- composes component probes + the SUT's own
ready probe in parallel (steps 1.1-1.8). "Are we ready to start testing?"
domain-discovery-engine.ts- wraps one scenario in one file
(steps 2.1-2.8) for the DomainDiscoveryEngine SUT. Phases: drive (2.1), observe (2.2), verify (2.3-2.7 against Mongo / S3 / ClickHouse / Loki), cleanup (2.8). One composite Result per glossary §scenario.
Segment file names are domain words (preflight, domain-discovery-engine) - not layer words.
A segment composes its children with parallelize(...) (default, ADR-0015) or sequentialize(...) when one child must finish before the next can start. Component probes are typically parallel (independent dependencies); scenario phases are typically sequential (each phase consumes prior phase's state).
The community has not converged on a name for this layer (Prometheus calls it a module, Checkly a check group, Datadog steps - all different). "Segment" reads naturally as "a slice of one run, scoped to one SUT."
run
run #
One top-level invocation of SQA. Produces one Result tree, one trace, one exit code. The whole thing. One run = one scenario × profile × commit when parameterized (e.g. a JSI trial).
Driven by src/index.ts. A run can be triggered by human (interactive make run), cron, ci, incident, or unknown - see ADR-0014. Persists to runs/<traceId>/ (and optionally s3://sqa/runs/<traceId>/).
The word is universal across the field (Datadog's "test run," Checkly's "check result," Nagios's "execution") so we use it without ceremony.
trace
trace #
The unique identifier assigned to one run. It threads through the Result tree, logs, emitted events, persisted artifacts, and the run directory (runs/<traceId>/) so every piece of evidence can be tied back to the exact invocation that produced it.
Trace IDs are produced by src/lib/trace.ts and follow ADR-0014. For parameterized JSI runs, the trace includes enough run context to distinguish profile / query / trial artifacts.
trial
trial #
One run, viewed as a member of a sweep. A synonym for run, used inside sweep contexts and dashboards. A sweep of 5 profiles × 4 queries spawns 20 trials; each trial is a normal run with the same artifact layout. Not a different shape - a different name for the same thing in a batch context. (Don't use "trial" to mean a flaky re-run; that's a retry.)
sweep
sweep #
A batch of runs, aggregated into one artifact directory (runs/sweep-<timestamp>/) with summary.json, results.tsv, fails.tsv, and findings.md. Manual and point-in-time: a human launches it for diagnostic work, not as a guardrail. Two flavors ship:
- Domain sweep -
src/runners/sweep/index.ts(PRD-21). N runs
over a fixed domain list; each child is a normal SQA run against one FQDN (bun run src/index.ts). "How does snappy 0.X.Y behave across the F500 list?"
- JSI sweep -
src/runners/jsi-sweep/(PRD-18). N trials
over a query × profile cartesian product; each trial runs runMetaintroChat() for one (profile, query) pair and produces one JSI score.
Distinct from continuous tail (which observes prod activations on a sample basis, continuously) and from probe (the leaf, asking one question of one component). The aggregate of a sweep is a scorecard.
continuous tail
continuous tail #
A continuous, sampled observation of real prod activations, driven by src/runners/tail/index.ts (PRD-21 / ADR-0039). The tail subscribes to a fraction of activations as they land in snappy.domains_analytics (default 1%, hash-on-domainId for deterministic per-domain sampling), runs the existing C1-C7 verifier stack against each in post-hoc mode (no drive, no cleanup - the activation already happened), and persists each result to ClickHouse sqa.sqa_runs for the dashboard + three-ladder alerting (page / warn / info).
The tail's verifier surface is identical to the sweep's - both consume validateClaim and the claim set from @metaintro/snappy-contracts. The difference is provenance: sweep mode creates a probe domain and verifies its own writes; tail mode observes an activation snappy created and verifies its side-effects. Adding a new claim to the contracts package lights up both surfaces with zero edits in SQA.
Distinct from sweep (manual, point-in-time) and from scenario (which the tail composes from - it runs the verify phases of the DomainDiscoveryEngine scenario, with drive/observe/cleanup deliberately absent).
step
step #
The user-facing label on a probe, segment, or scenario phase in pretty output and log lines: 1, 1.1, 1.2.3. Hierarchical, declaration-ordered. Top-level (1, 2) = a segment; second level (1.1) = a child within it; third level (1.1.1) = a sub-step. Defined at src/lib/step.ts and ADR-0005.
A step is what humans read. Its purpose is "tell me which probe this is" without forcing the reader to scan span IDs.
span
span #
The tracing primitive a step opens. Each step("1.7", "...", fn) call begins a span, propagated via AsyncLocalStorage, that stamps log lines with traceId / spanId / parentSpanId. Defined at src/lib/trace.ts and ADR-0014. W3C/OpenTelemetry-format IDs.
Step is the label; span is the trace primitive. They map 1:1 - every step opens exactly one span - but the words are not interchangeable. "Step 1.7" reads naturally; "span 1.7" does not.
SUT
SUT #
A system or SUT is the discipline-level thing SQA evaluates (snappy, metaintro-chat, future siblings - also called "system under test" or "SUT"). One folder per system under src/systems/. Layout (flat, no subdirectories):
- the system's own
ready.ts(a probe against the SUT's public
surface);
- one
<scenario>.tsper scenario, named after the SUT (e.g.
domain-discovery-engine.ts for the DomainDiscoveryEngine SUT) - see §scenario;
- one
<segment>.tsper segment that composes more than one
child (e.g. preflight.ts); single-child segments may be composed inline from index.ts;
index.ts- the system's top-level flow, exportingrunX();- per ADR-0022, files for probe-owned fixtures the SUT needs
(e.g. cohort.ts for snappy's auto-bootstrapped probe cohort).
SUT identifiers mirror the SUT's own engine names (ADR-0046); the folder is named after what is under test, never after a kind of testing.
component
component #
A component is a service or dependency the system relies on - S3, MongoDB, Redis, ClickHouse, Loki, Hatchet, OpenRouter, the browser, and so on. The system has components; SQA verifies the system by probing each component plus driving the system itself.
One folder per component under src/components/<name>/. Each folder holds the probes against that component (ready.ts, count.ts, version.ts, …). The browser is a component family (ADR-0050).
The word component in this codebase is the systems-engineering sense ("a part of a system") - not the testing-literature sense ("a smaller code unit being tested in isolation"). Contributors from a testing background should map our "component" to "external dependency."
instrument
instrument #
The means a probe or scenario uses to reach a component: the transport that carries the observation, independent of what is being asked. A closed vocabulary, pinned in src/lib/instrument.ts: browser | cli | fetch | model | composite | in-process.
The same question ("did the chat return good matches?") can be carried by different instruments - drive a real browser (Playwright), shell out to a CLI, issue a raw fetch, or query a model. The instrument decides how the SUT is exercised and what evidence a step can capture (a browser yields screenshots + video; a fetch yields headers + body; a model yields a scored judgment). A step MAY declare its carrier via the optional Result.instrument field; where it doesn't, inferInstrument(stepName) derives it at read time (PRD-30 §30.5.8). Browser instrument helpers live as a SUT-agnostic family at src/lib/browser/ (ADR-0050).
Not a component (the component is what is observed; the instrument is the means). Not a probe (a probe uses an instrument). Not evidence (the instrument is how a step ran; evidence is what it produced - provenance lives on the evidence, per ADR-0051).
Durable Execution Engine (DEE)
Durable Execution Engine (DEE) #
A Durable Execution Engine is a runtime that executes workflow steps against a durable, append-only log tracking workflow progress, parameters, return values, and durable promises. The engine can re-hydrate a workflow from the log after process death, worker outage, or restart - every step's effect is recoverable from the log.
Source: Bellemare, Building Event-Driven Microservices (2nd ed., 2024) Ch 10. The DEE category is the 2nd edition's framing for systems that previously got called "workflow engines" or "orchestrators"; the durability + append-only-log distinction is what makes them DEEs and not just schedulers.
Canonical instance for snappy: Hatchet. Snappy runs its workflows on Hatchet; SQA's verifier never reads Hatchet's internal log directly - it reads the durable rows the Hatchet workflows write (Mongo outbox_entries, snapshots, ClickHouse mirrors, S3 objects) and the Loki log markers those workflows emit. See ADR-0031 for the verifier pattern that reads the outbox the DEE produces.
Verdict - what a run produces #
outcome
outcome #
The closed-vocabulary verdict a Result carries. One of seven values - six categorical (per ADR-0020, superseding ADR-0012's original five) plus the continuous score (per ADR-0048), pinned at src/lib/result.ts:
| outcome | meaning |
|---|---|
pass | system answered correctly |
warn | system answered, answer is degraded but not blocking |
fail | system answered wrongly (auth denied, wrong shape) |
unknown | timeout / circuit-broken - absence of information, not a probe-side bug (ADR-0020, DDIA Ch. 9) |
error | probe-side bug: could not even ask correctly (programming error, malformed config) |
skip | deliberately not run (env not configured, prereq absent) |
score | a continuous graded verdict - carries {value ∈ [0,1], band} in context (ADR-0048) |
fail vs unknown vs error is the load-bearing trichotomy: the system said no (fail) vs we don't know what the system would say (unknown) vs we made a mistake asking (error). Different remediations, all preserved.
score is a genuinely different kind of claim - a graded quality dimension, not a binary gate. A score result never escalates to fail: a RED band stays inside the score dimension for aggregation (ADR-0048 D6), so a low-but-honest quality reading does not look like an operational outage.
Severity ordering for aggregation: pass < skip < warn < unknown < fail < error, with the special rule that a composite whose worst child is skip promotes to warn. Exit-code mapping: pass / warn / skip / unknown / score → 0, fail / error → 1 (unknown is "run completed; no information" - not a CI failure, but flagged in dashboards).
The discriminant field is named outcome, not status, because HTTP status codes appear constantly in Result.context and a duplicate name there would be a readability landmine.
score
score #
A continuous numeric verdict in [0,1], banded into green/yellow/red. The score outcome kind (ADR-0048), emitted via score(name, value, context) from src/lib/result-score.ts with value ∈ [0,1] (throws on out-of-range - a malformed score is a probe-side bug). Surfaced as 0-100 in the web/API layer; the underlying value stays [0,1] in the Result.
Use a score when you want a continuous, falsifiable quality measure ("how relevant were these jobs?") rather than a pass/fail. Operational facts (did the endpoint answer?) stay binary - pass/fail/unknown. Score maps to a band; band feeds the verdict.
band
band #
A named region of the score line - green, yellow, or red (lowercase literals in code; UI prose may render them uppercase). Cutoffs are fixed across all scores by scoreBandFor() in src/lib/result-score.ts (ADR-0048):
| band | value ∈ [0,1] | 0-100 | reading |
|---|---|---|---|
green | ≥ 0.75 | ≥ 75 | healthy / candidate-quality |
yellow | 0.55 - 0.75 | 55 - 75 | borderline; investigate |
red | < 0.55 | < 55 | bad outcome; performed poorly |
Bands are global - don't invent per-probe cutoffs, or dashboard aggregation breaks. Re-calibrating a cut-point is a contract change and requires an ADR (see the deep-dive on Verdict). Not a threshold (a threshold is a single cutoff; bands partition the whole line).
verdict
verdict #
The single operator-readable summary of a run - a sentence a non-engineer reads in a few seconds. It is not the same as outcome: the outcome tells the machine whether the gate passed; the verdict tells the human what happened. Both ship on every run.
For score-bearing scenarios (e.g. JSI), the verdict is the number plus a band word - "Mostly relevant - 73/100 · yellow" - with the per-axis components underneath for legibility. For operational scenarios, it is the worst outcome plus a one-line reason - "fail - s3.ready returned 403." A bare "67/100" is not a verdict (missing the word and band); a verdict without per-axis components is opaque. Full treatment: glossary-deep-dives.md §2.
judge
judge #
A verifier - an LLM ensemble or a rule - that grades evidence against a rubric. Per ADR-0033, a judge must name its rubric. For JSI the judge is an LLM ensemble consuming prompts from src/lib/verify/scoring/relevancy-prompt.ts and coverage-prompt.ts, emitting a score with a per-job vote record.
Not a probe (a probe asks "is X up?"; a judge grades "how good was the answer?"). Judges produce scores; scores feed verdicts; judges cite rubrics; rubrics define the falsifiers for graded claims. Lives under src/lib/verify/ (ADR-0047).
axis
axis #
A measurement dimension of a graded probe - one orthogonal thing being scored, a named numeric ∈ [0,1] produced by one judge. The shipped JSI has two axes: Relevancy (weight 0.60) and Coverage (weight 0.40), per PRD-18 D8 and src/lib/verify/scoring/jsi.ts.
Not a score (an axis is the dimension; the score is the value on it). To add an axis, add a *-prompt.ts judge under src/lib/verify/scoring/ and extend the composition math.
JSI
JSI #
The canonical graded probe today: a banded measure (0-1, shown 0-100) of how well the metaintro-chat job-search scenario delivers on its single client-facing claim - "the jobs returned are relevant to what the user asked for." The shipped composition (per src/lib/verify/scoring/jsi.ts, ADR-0048):
JSI_raw = 0.60·Relevancy + 0.40·Coverage (weighted mean)
JSI = min(JSI_raw, 0.30) when either axis < 0.30 (gate cap)A JSI sweep of N profiles × M queries produces N×M scores; the scorecard reports mean / median / worst-case JSI per profile. JSI is a probe of metaintro-chat, not the SUT itself, and not a generic name for "any quality score" - a sibling graded probe (e.g. a future corpus-quality index for snappy) follows the same pattern under its own name.
Job Search Index is the expanded name of JSI.
relevancy
relevancy #
The JSI gate axis that asks whether returned jobs answer the query. The relevancy judge scores each returned job against the query, profile, and rubric, then composes those per-job grades into a ranking-sensitive score.
coverage
coverage #
The JSI gate axis that asks whether the result set covers the query's named aspects - seniority, technology, location, work mode, and similar constraints. A result set can contain one strong job and still lose coverage when other explicit aspects are absent.
nDCG
client
client #
The party a system promises value to - the role on the receiving end of a contract. The system delivers to its client; the verdict says whether it kept that promise. A client is a role filled by one of three kinds: a user (a human), an agent (an LLM acting on someone's behalf), or a system (another system integrating against this one).
One contract serves one client kind. Many personas of one kind (P1-P5 are all user-kind) stay one contract; two kinds (a human and an integrating system) are two scenarios with two contracts, because the promise, the claims, and the falsifiers all differ. A user-kind client is characterized by a persona.
Not a persona (a persona specifies a user-kind client; it is not the client). Not the reader of the SQA report (that's a reader - the operator/LLM consuming the verdict, distinct from the client the system serves). Client + kind are declared in the contract doc; they become a typed field on the scenario only when code must branch on kind. (Emerging - glossary-pinned, no dedicated ADR yet.)
contract
contract #
The promise-to-clients document for one scenario, written per the Keep Your Contract standard. It describes value - who is served, what they get, how we can tell the promise was kept - not shape (an OpenAPI spec or a type describes shape). One file at docs/contracts/<sut>/<scenario>.md.
SQA writes contracts about the SUT, not for the SUT to honor on SQA's behalf - it is the witness, not the promiser (ADR-0032). A verifier that authors its own promise can never falsify it. The canonical examples are VALUE.md ↗ (SQA's own value contract) and docs/contracts/metaintro-chat/job-search.md (a per-SUT contract that collapses verifier-shaped assertions into one client-facing claim). Full treatment: glossary-deep-dives.md §1.
claim
claim #
A single load-bearing sentence in a contract. Has an ID (e.g. C4), a strength, a status, a verification method, and a falsifier. One promise per claim - if two would falsify independently, write two claims.
Good: "C1: the jobs returned in the chat thread MUST be relevant to the user's query, judged by an LLM ensemble against rubric R1." Bad: "the system should be reliable" - not falsifiable; no observable event refutes it. A claim, when falsified, produces a gap.
strength
strength #
The RFC 2119 modal of a claim: MUST / SHOULD / MAY. MUST = breakage is a fail; SHOULD = breakage is a warn (with documented exceptions); MAY = discretionary, breakage is acceptable. Grades how load-bearing each claim is. Distinct from status (strength is how strong; status is where in the lifecycle).
status
status #
Where a claim is in its lifecycle, from a closed, append-only vocabulary: Hypothesized → Committed → Verified → Broken → Superseded → Retracted. Hypothesized when first written; Committed once the team stakes the product on it; Verified once a run demonstrates it holds in evidence; Broken when a run falsifies it. Never edit an accepted claim's text in place - supersede instead, so the run that verified it still refers to the same sentence.
verification method
verification method #
How a claim is checked, from IEEE 29148's closed set: Inspection (read the code/doc) · Analysis (model it) · Demonstration (exercise and observe) · Test (automated assertion) · Judge (rubric-graded by LLM or human) · Field (observation in production). Each claim names its method in the contract's verification block; choose the lightest method that can falsify.
Test here is one of six verification methods - a specific, bounded term from IEEE 29148. It is not the banned layer-noun "test" (see "Words we deliberately don't use"); don't use "test" as a synonym for probe/segment.falsifier
falsifier #
The observable event that, if seen, refutes a claim. Stated as a positive, concrete event observable from outside the SUT (ADR-0034 - negative claims are observational, not adversarial). Write the thing you would see that means the claim is wrong.
Good: "a returned job whose title matches none of the query synonyms, judged by ≥2 of 3 LLM voters." Bad: "the system fails" - not observable, not specific. A passive falsifier ("if it breaks, we'll notice") is not a falsifier. Falsifier triggered → gap recorded.
gap
gap #
An instance of a falsified claim, attributed to a specific run. A GapRow at runs/<traceId>/gaps.json (src/lib/gap-log.ts, ADR-0027), carrying runId, ts, scenario, claim (the claim ID), observed, expected, and evidence. Verifier code calls appendGap({...}) when a falsifier fires - one gap per falsified claim instance per run. A gap with no evidence block is untriageable.
A gap is a finding with a claim attribution - Gap ⊂ Finding.
finding
finding #
Any noteworthy observation in a run - broader than a gap. A finding may or may not cite a claim: it includes gaps (falsified claims), surprises (unexpected SUT behavior not yet in a contract), and operational notes. Rendered into per-run reports and a sweep's findings.md. Findings are the superset; gaps are the claim-attributed subset.
evidence
evidence #
One typed artifact a run produced that supports or refutes a claim - a screenshot, a log file, a scraped row, a score, a Loki link, a video. A discriminated union on kind (file | link | text | row | metric | media) at src/lib/evidence.ts, pinned by ADR-0051. Each piece carries an id, a caption, optional stepId provenance (which step/span produced it), optional claimIds (which claims it bears on), and a kind-discriminated payload; file/media kinds also carry court-grade provenance (sha256, sizeBytes, contentType).
Not a gap (a gap cites evidence via claimIds). Not the instrument (the instrument is the means of observing; evidence is what the observation captured - provenance lives on the evidence, per ADR-0051).
Inputs - how a run is parameterized #
archetype
persona
persona #
The canonical human-readable name for a simulated user (e.g. "P1 - Mid-Senior IC Engineer (US)"). One entry in src/systems/metaintro-chat/corpus/profiles.json, carrying an id, label, detail, and onboarding block. Five personas today (P1-P5) spanning easy → hard with global coverage.
A persona characterizes a user-kind client - it pinpoints how that client uses the system. It is the external (UI/report/stakeholder) name for what the code calls a profile.
profile
profile #
The data shape backing a persona - the actual JSON object loaded from profiles.json (id, label, detail, onboarding). Conceptually identical to a persona; profile is the internal word, persona is the external word. Convention: in code and storage, say profile; in UI, reports, and stakeholder conversation, say persona; never both in one paragraph. A profile parameterizes a run.
onboarding
onboarding #
The pre-condition account state a run needs - the answers the SUT's onboarding flow expects (careerStage, targetRole, locationPreference, skills, …), held in each profile's onboarding block. Onboarding state must be true before a query can run; readiness is a first-class preflight step (ADR-0024). Account-level state, set once per profile - distinct from the per-run query.
query
query #
The user's typed input into the SUT - the prompt that triggers the scenario's drive phase (e.g. "senior react engineer remote"). For sweeps, queries come from a corpus (docs/research/jsi-gold-set-v1.json). Load-bearing input: the verdict only makes sense relative to the query. Per-run - distinct from per-profile onboarding.
Diagnostics - sweep & activation metrics #
scorecard
scorecard #
A single snapshot of the activation system's measured behavior, computed by computeScorecard(rows) in src/lib/scorecard.ts (PRD-23 §23.1.1). Carries the North Star metric (EAR) plus Tier 2 driver rates plus Tier 3 diagnostics (cohort breakdown, lever-firing rates and counts, terminal distribution). Pure function - same input rows → same output scorecard, regardless of surface (Markdown, JSON, ClickHouse row).
A scorecard is the output of one sweep run (written to sqa.sqa_sweep_summary) or of one ClickHouse aggregation over a tail window. Independent of surface and of provenance.
Explained Activation Rate (EAR)
Explained Activation Rate (EAR) #
The North Star metric for snappy's activation system:
EAR = (active + blocked + canonical-merged) / (total - transport-error)The numerator is "snappy reached a state we can explain." The denominator excludes transport-error because that's an SQA-side fault, not a snappy-side outcome - counting it would penalise SQA for its own observability failures.
Rationale and Tier 2 driver tree pinned in docs/research/2026-05-13-activation-north-star-kpi.md ↗ §2. First baseline measured 2026-05-13 against snappy 0.16.0 on F500: 87.60% (438/500).
Tier 2 driver
Tier 2 driver #
One of the four metrics that decompose EAR's gap from 1: silent-unresolved rate, probe-body-undelivered rate, transport-error rate, and cohort-other rate. Each answers a different "why didn't activation succeed?" question - silent-unresolved is "snappy gave up without telling us why," probe-body-undelivered is "snappy reached the site but couldn't read it," transport-error is "SQA couldn't ask snappy," and cohort-other is "snappy said something we haven't classified yet." They live on Scorecard.tier2 and propagate into sqa.sqa_sweep_summary columns for trend queries.
metric vs KPI
metric vs KPI #
A metric is anything quantitative a run emits. A KPI is a metric promoted to load-bearing: KPI = metric × promise. Promotion needs three things - the contract names it, a band gives it a verdict story, and a regression in it changes an outcome. The same number can be a metric in one probe and a KPI in another. Full treatment, with the metrics catalog and drift causes: glossary-deep-dives.md §3.
Words we deliberately don't use
Words we deliberately don't use #
- check - too broad; every conditional is a "check" in English.
Use "probe" for the unit, "outcome" for the verdict. (Exception: Test and the other IEEE-29148 verification methods are bounded contract-side terms, not layer nouns.)
- monitor - commercial-uptime-tooling vocabulary (New Relic,
BetterStack, Uptime Kuma); not our register.
- suite - test-framework baggage; "segment" is what we mean.
- test - overloaded with code-test connotations; SQA is not the
unit-test layer. (The Test verification method is the one bounded exception.)
- flow / preflight as a layer word - preflight is a
kind of segment (the one that runs first); other segments won't be preflights. As a layer word, use "segment."
- target as a layer word -
targets.tsalready names the env
→ URL/credentials map; using it as a synonym for SUT would collide.
See also
See also #
glossary-deep-dives.md- long-form
treatments of Contract, Verdict, and Metrics & KPIs.
docs/problem.md- why SQA exists.- ADR-0052 -
scope includes value verification (supersedes ADR-0007).
- ADR-0020 /
ADR-0048 - the outcome model.
- ADR-0005 /
ADR-0014 - steps & spans.