Glossary

ReferenceUpdated 2026-05-29 11:10 EDT27 min readEdit on GitHub ↗

47 sections··

Glossary #

Diátaxis form: reference. The single canonical vocabulary for SQA. Every other doc - PRDs, ADRs, guides, README, hardening-run, the web portal - uses these words and only these words for the concepts they name. If a term feels missing or wrong, change this file first, then sweep. This file merges the three former glossary docs (the layer glossary, the glossary-and-IA doc, and the deep-dives) into one source of truth; the long-form treatments for Contract, Verdict, and Metrics live in glossary-deep-dives.md.

The terms below cover three layers: the structure of an SQA run (probe → segment → run), the verdict it produces (outcome, score, band, verdict), and the contract it judges against (client, claim, falsifier, gap). Scope note: SQA verifies whether a system delivers the value it claims to its users, including graded, LLM-judged quality - see ADR-0052, which broadened the original operational-only scope of ADR-0007.

How the layers fit together

How the layers fit together #

flowchart TB
  Run["run<br/>(one invocation, one trace, one exit code)"]
  System["system / SUT<br/>(snappy)"]
  Run --> System

  subgraph SystemBox [" "]
    direction TB
    SegA["segment: preflight<br/>(parallel probes)"]
    SegB["segment: domain-discovery-engine<br/>(synthetic transaction)"]

    System --> SegA
    System --> SegB

    subgraph PreflightChildren ["component probes"]
      direction LR
      ReadyOwn["snappy.ready"]
      ProbeS3["s3.ready"]
      ProbeMongo["mongo.ready"]
      ProbeEtc["…6 more"]
    end
    SegA --> ReadyOwn
    SegA --> ProbeS3
    SegA --> ProbeMongo
    SegA --> ProbeEtc

    subgraph ScenarioChildren ["one file: drive → observe → cleanup"]
      direction TB
      Step1["2.1 drive: POST /domains"]
      Step2["2.2 observe: poll status"]
      Step3["2.3 observe: history"]
      Step4["2.4 cleanup: DELETE"]
      Step1 --> Step2 --> Step3 --> Step4
    end
    SegB --> Step1
  end

  subgraph Components ["src/components/&lt;name&gt;/"]
    direction LR
    CompS3[s3]
    CompMongo[mongodb]
    CompLoki[loki]
    CompMore["…"]
  end

  subgraph SUTAPI ["SUT public API"]
    SUTEP["snappy HTTP endpoints"]
  end

  ProbeS3 -.HTTP/SDK.-> CompS3
  ProbeMongo -.SDK.-> CompMongo
  ReadyOwn -.HTTP.-> SUTEP
  Step1 -.HTTP.-> SUTEP
  Step2 -.HTTP.-> SUTEP
  Step3 -.HTTP.-> SUTEP
  Step4 -.HTTP.-> SUTEP

  Outcome["Result envelope (seven-state)<br/>pass / warn / fail / unknown / error / skip / score<br/>(ADR-0020 + ADR-0048)"]
  Step4 --> Outcome
  ProbeEtc --> Outcome
  ReadyOwn --> Outcome
  Outcome --> ExitCode["process.exitCode<br/>0 (pass/warn/skip/unknown/score) / 1 (fail/error)"]

Two kinds of segment, two kinds of work. Preflight composes component probes (each asks one question of one dependency). The domain-discovery-engine segment wraps one scenario - a synthetic transaction: drive → observe → verify → cleanup, all in one file, returning one composite Result.

How to read the diagram:

A run drives the SUT through one or more segments.
Each segment composes children of one kind: a bundle of

probes (preflight), or a single scenario (a synthetic transaction that drives the SUT and observes what it did). The run's segments collectively cover both kinds.

A scenario is internally composed of steps (drive /

observe / verify / cleanup), not of segments. Steps nest with dotted IDs (2.1, 2.1.3).

Every leaf returns a Result with one of seven outcomes

(six categorical, per ADR-0020, plus the continuous score, per ADR-0048). Outcomes aggregate up the tree (worst child wins, with the skip→warn promotion rule). The root's outcome maps to the process exit code.

Probes talk to components over the wire (HTTP, SDK, gRPC).

Scenarios talk to the SUT's public surface and may also re-query components to confirm side-effects landed where the SUT claimed.

How a run becomes a verdict

How a run becomes a verdict #

The structural layer answers did it run?; the verdict layer answers did it deliver? Both ship on every run:

flowchart LR
  Scenario["Scenario"] -->|governed by| Contract["Contract"]
  Contract -->|asserts| Claim["Claim (Strength · Status · Falsifier)"]
  Run["Run"] -->|executes| Scenario
  Run -->|judged by| Judge["Judge (LLM ensemble or rule)"]
  Judge -->|produces| Score["Score ∈ [0,1]"]
  Score -->|banded into| Band["Band (green / yellow / red)"]
  Run -->|produces| Outcome["Outcome (7-state)"]
  Outcome --> Verdict["Verdict (human summary)"]
  Band --> Verdict
  Claim -->|when falsified| Gap["Gap"]
  Gap -->|cites| Evidence["Evidence"]

A run executes a scenario that a contract governs. A judge grades the scenario's evidence against the contract's claims, producing a score that bands into green/yellow/red. The machine reads the outcome ("did the gate pass?"); the human reads the verdict ("what happened?"), which packages outcome + score + band into one legible sentence. A falsified claim becomes a gap, which cites the evidence that refutes it.

Structure - the layers of a run #

probe

probe #

The smallest unit of work that asks one question of a component of the system. One function in one file, returns one Result. Pure: no throws, no side effects beyond logging, same input → same output shape (per ADR-0012).

Lives at src/components/<system>/<verb>.ts (a component-of-the-SUT probe; the typical case) or at src/systems/<sut>/ready.ts (a probe against the SUT's own surface - the SUT is itself one of its components, observed from outside via its public API).

Examples: s3.ready, mongodb.ready, loki.ready, snappy.ready (the SUT's /health/ready).

The word matches Kubernetes (livenessProbe / readinessProbe), Prometheus Blackbox (/probe endpoint), and the Google SRE book. The IETF/Nagios noun "check" means the same thing in those ecosystems but is too broad in English (every conditional is a "check") - we don't use it as a layer-1 noun.

scenario

scenario #

A synthetic transaction that drives the SUT through its public API to exercise a business behavior, then observes the SUT's response and verifies the side-effects. One scenario = one file = one composite Result.

Lives at src/systems/<sut>/<scenario-name>.ts. Named after the SUT and the behavior it drives (domain-discovery-engine, web-crawl-engine, metaintro-chat), not after any component or phase.

A scenario is composed of three kinds of internal phases - these are roles (and steps) inside the scenario file, not separate layers, and not segments:

drive - a SUT-mutating call (POST, DELETE).
observe - a read against the SUT's own surface or its

observability components (poll the API for state, query Loki for log lines, read Hatchet for workflow runs). The SUT narrating what it just did.

verify - a read against a component the SUT was expected to

write to (Mongo collection, ClickHouse table, S3 prefix). The side-effects actually landed.

A scenario typically runs drive → observe → (optional verify) → cleanup, with state threaded between phases (a domainId produced by drive becomes input to observe and cleanup). The file aggregates the phase Results into one composite Result; the scenario reports pass only when every phase passed.

Why one file, not separate folders. The literature is unanimous: a synthetic transaction is one logical unit. Sam Newman's Building Microservices (2nd ed., ch. 10, "Semantic Monitoring") and Google's SRE book (ch. 17, "Black-Box Monitoring") both treat the phases as roles inside one transaction, not as a folder layer. Splitting drive/observe/verify into separate directories fragments the transaction and makes shared state harder to thread. We follow the books.

A scenario is not a probe: probes ask "is X up?" while scenarios ask "does the SUT do Y when poked?" A scenario can be slotted into a run as one segment.

segment

segment #

The composition layer. Composes one kind of child for one SUT and aggregates their Results into one composite Result. Pure: segments don't throw, they aggregate (ADR-0012). Each segment is either a bundle of probes or a single scenario - not both at once; a run's segments collectively span both kinds.

Lives at src/systems/<sut>/<segment>.ts. Today snappy has two segments:

preflight.ts - composes component probes + the SUT's own

ready probe in parallel (steps 1.1-1.8). "Are we ready to start testing?"

domain-discovery-engine.ts - wraps one scenario in one file

(steps 2.1-2.8) for the DomainDiscoveryEngine SUT. Phases: drive (2.1), observe (2.2), verify (2.3-2.7 against Mongo / S3 / ClickHouse / Loki), cleanup (2.8). One composite Result per glossary §scenario.

Segment file names are domain words (preflight, domain-discovery-engine) - not layer words.

A segment composes its children with parallelize(...) (default, ADR-0015) or sequentialize(...) when one child must finish before the next can start. Component probes are typically parallel (independent dependencies); scenario phases are typically sequential (each phase consumes prior phase's state).

The community has not converged on a name for this layer (Prometheus calls it a module, Checkly a check group, Datadog steps - all different). "Segment" reads naturally as "a slice of one run, scoped to one SUT."

run

run #

One top-level invocation of SQA. Produces one Result tree, one trace, one exit code. The whole thing. One run = one scenario × profile × commit when parameterized (e.g. a JSI trial).

Driven by src/index.ts. A run can be triggered by human (interactive make run), cron, ci, incident, or unknown - see ADR-0014. Persists to runs/<traceId>/ (and optionally s3://sqa/runs/<traceId>/).

The word is universal across the field (Datadog's "test run," Checkly's "check result," Nagios's "execution") so we use it without ceremony.

trace

trace #

The unique identifier assigned to one run. It threads through the Result tree, logs, emitted events, persisted artifacts, and the run directory (runs/<traceId>/) so every piece of evidence can be tied back to the exact invocation that produced it.

Trace IDs are produced by src/lib/trace.ts and follow ADR-0014. For parameterized JSI runs, the trace includes enough run context to distinguish profile / query / trial artifacts.

trial

trial #

One run, viewed as a member of a sweep. A synonym for run, used inside sweep contexts and dashboards. A sweep of 5 profiles × 4 queries spawns 20 trials; each trial is a normal run with the same artifact layout. Not a different shape - a different name for the same thing in a batch context. (Don't use "trial" to mean a flaky re-run; that's a retry.)

sweep

sweep #

A batch of runs, aggregated into one artifact directory (runs/sweep-<timestamp>/) with summary.json, results.tsv, fails.tsv, and findings.md. Manual and point-in-time: a human launches it for diagnostic work, not as a guardrail. Two flavors ship:

Domain sweep - src/runners/sweep/index.ts (PRD-21). N runs

over a fixed domain list; each child is a normal SQA run against one FQDN (bun run src/index.ts). "How does snappy 0.X.Y behave across the F500 list?"

JSI sweep - src/runners/jsi-sweep/ (PRD-18). N trials

over a query × profile cartesian product; each trial runs runMetaintroChat() for one (profile, query) pair and produces one JSI score.

Distinct from continuous tail (which observes prod activations on a sample basis, continuously) and from probe (the leaf, asking one question of one component). The aggregate of a sweep is a scorecard.

continuous tail

continuous tail #

A continuous, sampled observation of real prod activations, driven by src/runners/tail/index.ts (PRD-21 / ADR-0039). The tail subscribes to a fraction of activations as they land in snappy.domains_analytics (default 1%, hash-on-domainId for deterministic per-domain sampling), runs the existing C1-C7 verifier stack against each in post-hoc mode (no drive, no cleanup - the activation already happened), and persists each result to ClickHouse sqa.sqa_runs for the dashboard + three-ladder alerting (page / warn / info).

The tail's verifier surface is identical to the sweep's - both consume validateClaim and the claim set from @metaintro/snappy-contracts. The difference is provenance: sweep mode creates a probe domain and verifies its own writes; tail mode observes an activation snappy created and verifies its side-effects. Adding a new claim to the contracts package lights up both surfaces with zero edits in SQA.

Distinct from sweep (manual, point-in-time) and from scenario (which the tail composes from - it runs the verify phases of the DomainDiscoveryEngine scenario, with drive/observe/cleanup deliberately absent).

step

step #

The user-facing label on a probe, segment, or scenario phase in pretty output and log lines: 1, 1.1, 1.2.3. Hierarchical, declaration-ordered. Top-level (1, 2) = a segment; second level (1.1) = a child within it; third level (1.1.1) = a sub-step. Defined at src/lib/step.ts and ADR-0005.

A step is what humans read. Its purpose is "tell me which probe this is" without forcing the reader to scan span IDs.

span

span #

The tracing primitive a step opens. Each step("1.7", "...", fn) call begins a span, propagated via AsyncLocalStorage, that stamps log lines with traceId / spanId / parentSpanId. Defined at src/lib/trace.ts and ADR-0014. W3C/OpenTelemetry-format IDs.

Step is the label; span is the trace primitive. They map 1:1 - every step opens exactly one span - but the words are not interchangeable. "Step 1.7" reads naturally; "span 1.7" does not.

SUT

SUT #

A system or SUT is the discipline-level thing SQA evaluates (snappy, metaintro-chat, future siblings - also called "system under test" or "SUT"). One folder per system under src/systems/. Layout (flat, no subdirectories):

the system's own ready.ts (a probe against the SUT's public

surface);

one <scenario>.ts per scenario, named after the SUT (e.g.

domain-discovery-engine.ts for the DomainDiscoveryEngine SUT) - see §scenario;

one <segment>.ts per segment that composes more than one

child (e.g. preflight.ts); single-child segments may be composed inline from index.ts;

index.ts - the system's top-level flow, exporting runX();
per ADR-0022, files for probe-owned fixtures the SUT needs

(e.g. cohort.ts for snappy's auto-bootstrapped probe cohort).

SUT identifiers mirror the SUT's own engine names (ADR-0046); the folder is named after what is under test, never after a kind of testing.

component

component #

A component is a service or dependency the system relies on - S3, MongoDB, Redis, ClickHouse, Loki, Hatchet, OpenRouter, the browser, and so on. The system has components; SQA verifies the system by probing each component plus driving the system itself.

One folder per component under src/components/<name>/. Each folder holds the probes against that component (ready.ts, count.ts, version.ts, …). The browser is a component family (ADR-0050).

The word component in this codebase is the systems-engineering sense ("a part of a system") - not the testing-literature sense ("a smaller code unit being tested in isolation"). Contributors from a testing background should map our "component" to "external dependency."

instrument

instrument #

The same question ("did the chat return good matches?") can be carried by different instruments - drive a real browser (Playwright), shell out to a CLI, issue a raw fetch, or query a model. The instrument decides how the SUT is exercised and what evidence a step can capture (a browser yields screenshots + video; a fetch yields headers + body; a model yields a scored judgment). A step MAY declare its carrier via the optional Result.instrument field; where it doesn't, inferInstrument(stepName) derives it at read time (PRD-30 §30.5.8). Browser instrument helpers live as a SUT-agnostic family at src/lib/browser/ (ADR-0050).

Not a component (the component is what is observed; the instrument is the means). Not a probe (a probe uses an instrument). Not evidence (the instrument is how a step ran; evidence is what it produced - provenance lives on the evidence, per ADR-0051).

Durable Execution Engine (DEE)

Durable Execution Engine (DEE) #

A Durable Execution Engine is a runtime that executes workflow steps against a durable, append-only log tracking workflow progress, parameters, return values, and durable promises. The engine can re-hydrate a workflow from the log after process death, worker outage, or restart - every step's effect is recoverable from the log.

Source: Bellemare, Building Event-Driven Microservices (2nd ed., 2024) Ch 10. The DEE category is the 2nd edition's framing for systems that previously got called "workflow engines" or "orchestrators"; the durability + append-only-log distinction is what makes them DEEs and not just schedulers.

Canonical instance for snappy: Hatchet. Snappy runs its workflows on Hatchet; SQA's verifier never reads Hatchet's internal log directly - it reads the durable rows the Hatchet workflows write (Mongo outbox_entries, snapshots, ClickHouse mirrors, S3 objects) and the Loki log markers those workflows emit. See ADR-0031 for the verifier pattern that reads the outbox the DEE produces.

Verdict - what a run produces #

outcome

outcome #

The closed-vocabulary verdict a Result carries. One of seven values - six categorical (per ADR-0020, superseding ADR-0012's original five) plus the continuous score (per ADR-0048), pinned at src/lib/result.ts:

outcome	meaning
`pass`	system answered correctly
`warn`	system answered, answer is degraded but not blocking
`fail`	system answered wrongly (auth denied, wrong shape)
`unknown`	timeout / circuit-broken - absence of information, not a probe-side bug (ADR-0020, DDIA Ch. 9)
`error`	probe-side bug: could not even ask correctly (programming error, malformed config)
`skip`	deliberately not run (env not configured, prereq absent)
`score`	a continuous graded verdict - carries `{value ∈ [0,1], band}` in context (ADR-0048)

fail vs unknown vs error is the load-bearing trichotomy: the system said no (fail) vs we don't know what the system would say (unknown) vs we made a mistake asking (error). Different remediations, all preserved.

score is a genuinely different kind of claim - a graded quality dimension, not a binary gate. A score result never escalates to fail: a RED band stays inside the score dimension for aggregation (ADR-0048 D6), so a low-but-honest quality reading does not look like an operational outage.

Severity ordering for aggregation: pass < skip < warn < unknown < fail < error, with the special rule that a composite whose worst child is skip promotes to warn. Exit-code mapping: pass / warn / skip / unknown / score → 0, fail / error → 1 (unknown is "run completed; no information" - not a CI failure, but flagged in dashboards).

The discriminant field is named outcome, not status, because HTTP status codes appear constantly in Result.context and a duplicate name there would be a readability landmine.

score

score #

A continuous numeric verdict in [0,1], banded into green/yellow/red. The score outcome kind (ADR-0048), emitted via score(name, value, context) from src/lib/result-score.ts with value ∈ [0,1] (throws on out-of-range - a malformed score is a probe-side bug). Surfaced as 0-100 in the web/API layer; the underlying value stays [0,1] in the Result.

Use a score when you want a continuous, falsifiable quality measure ("how relevant were these jobs?") rather than a pass/fail. Operational facts (did the endpoint answer?) stay binary - pass/fail/unknown. Score maps to a band; band feeds the verdict.

band

band #

A named region of the score line - green, yellow, or red (lowercase literals in code; UI prose may render them uppercase). Cutoffs are fixed across all scores by scoreBandFor() in src/lib/result-score.ts (ADR-0048):

band	value ∈ [0,1]	0-100	reading
`green`	≥ 0.75	≥ 75	healthy / candidate-quality
`yellow`	0.55 - 0.75	55 - 75	borderline; investigate
`red`	< 0.55	< 55	bad outcome; performed poorly

Bands are global - don't invent per-probe cutoffs, or dashboard aggregation breaks. Re-calibrating a cut-point is a contract change and requires an ADR (see the deep-dive on Verdict). Not a threshold (a threshold is a single cutoff; bands partition the whole line).

verdict

verdict #

The single operator-readable summary of a run - a sentence a non-engineer reads in a few seconds. It is not the same as outcome: the outcome tells the machine whether the gate passed; the verdict tells the human what happened. Both ship on every run.

For score-bearing scenarios (e.g. JSI), the verdict is the number plus a band word - "Mostly relevant - 73/100 · yellow" - with the per-axis components underneath for legibility. For operational scenarios, it is the worst outcome plus a one-line reason - "fail - s3.ready returned 403." A bare "67/100" is not a verdict (missing the word and band); a verdict without per-axis components is opaque. Full treatment: glossary-deep-dives.md §2.

judge

judge #

A verifier - an LLM ensemble or a rule - that grades evidence against a rubric. Per ADR-0033, a judge must name its rubric. For JSI the judge is an LLM ensemble consuming prompts from src/lib/verify/scoring/relevancy-prompt.ts and coverage-prompt.ts, emitting a score with a per-job vote record.

Not a probe (a probe asks "is X up?"; a judge grades "how good was the answer?"). Judges produce scores; scores feed verdicts; judges cite rubrics; rubrics define the falsifiers for graded claims. Lives under src/lib/verify/ (ADR-0047).

axis

axis #

A measurement dimension of a graded probe - one orthogonal thing being scored, a named numeric ∈ [0,1] produced by one judge. The shipped JSI has two axes: Relevancy (weight 0.60) and Coverage (weight 0.40), per PRD-18 D8 and src/lib/verify/scoring/jsi.ts.

Note

Some older material refers to "8 JSI axes." That was the exploratory roundbook-v4 framing, never the shipped composition. The code computes two axes; treat anything claiming eight as stale.

Not a score (an axis is the dimension; the score is the value on it). To add an axis, add a *-prompt.ts judge under src/lib/verify/scoring/ and extend the composition math.

JSI

JSI #

The canonical graded probe today: a banded measure (0-1, shown 0-100) of how well the metaintro-chat job-search scenario delivers on its single client-facing claim - "the jobs returned are relevant to what the user asked for." The shipped composition (per src/lib/verify/scoring/jsi.ts, ADR-0048):

text

JSI_raw = 0.60·Relevancy + 0.40·Coverage        (weighted mean)
JSI     = min(JSI_raw, 0.30)  when either axis < 0.30   (gate cap)

A JSI sweep of N profiles × M queries produces N×M scores; the scorecard reports mean / median / worst-case JSI per profile. JSI is a probe of metaintro-chat, not the SUT itself, and not a generic name for "any quality score" - a sibling graded probe (e.g. a future corpus-quality index for snappy) follows the same pattern under its own name.

Job Search Index is the expanded name of JSI.

relevancy

relevancy #

The JSI gate axis that asks whether returned jobs answer the query. The relevancy judge scores each returned job against the query, profile, and rubric, then composes those per-job grades into a ranking-sensitive score.

coverage

coverage #

The JSI gate axis that asks whether the result set covers the query's named aspects - seniority, technology, location, work mode, and similar constraints. A result set can contain one strong job and still lose coverage when other explicit aspects are absent.

nDCG

nDCG #

Normalized Discounted Cumulative Gain at 10, the ranking metric used by the relevancy axis. It rewards relevant jobs near the top of the returned list more than equally relevant jobs buried lower down.

Contract - what a run judges against #

client

client #

The party a system promises value to - the role on the receiving end of a contract. The system delivers to its client; the verdict says whether it kept that promise. A client is a role filled by one of three kinds: a user (a human), an agent (an LLM acting on someone's behalf), or a system (another system integrating against this one).

One contract serves one client kind. Many personas of one kind (P1-P5 are all user-kind) stay one contract; two kinds (a human and an integrating system) are two scenarios with two contracts, because the promise, the claims, and the falsifiers all differ. A user-kind client is characterized by a persona.

Not a persona (a persona specifies a user-kind client; it is not the client). Not the reader of the SQA report (that's a reader - the operator/LLM consuming the verdict, distinct from the client the system serves). Client + kind are declared in the contract doc; they become a typed field on the scenario only when code must branch on kind. (Emerging - glossary-pinned, no dedicated ADR yet.)

contract

contract #

The promise-to-clients document for one scenario, written per the Keep Your Contract standard. It describes value - who is served, what they get, how we can tell the promise was kept - not shape (an OpenAPI spec or a type describes shape). One file at docs/contracts/<sut>/<scenario>.md.

SQA writes contracts about the SUT, not for the SUT to honor on SQA's behalf - it is the witness, not the promiser (ADR-0032). A verifier that authors its own promise can never falsify it. The canonical examples are VALUE.md ↗ (SQA's own value contract) and docs/contracts/metaintro-chat/job-search.md (a per-SUT contract that collapses verifier-shaped assertions into one client-facing claim). Full treatment: glossary-deep-dives.md §1.

claim

claim #

A single load-bearing sentence in a contract. Has an ID (e.g. C4), a strength, a status, a verification method, and a falsifier. One promise per claim - if two would falsify independently, write two claims.

Good: "C1: the jobs returned in the chat thread MUST be relevant to the user's query, judged by an LLM ensemble against rubric R1." Bad: "the system should be reliable" - not falsifiable; no observable event refutes it. A claim, when falsified, produces a gap.

strength

strength #

The RFC 2119 modal of a claim: MUST / SHOULD / MAY. MUST = breakage is a fail; SHOULD = breakage is a warn (with documented exceptions); MAY = discretionary, breakage is acceptable. Grades how load-bearing each claim is. Distinct from status (strength is how strong; status is where in the lifecycle).

status

status #

Where a claim is in its lifecycle, from a closed, append-only vocabulary: Hypothesized → Committed → Verified → Broken → Superseded → Retracted. Hypothesized when first written; Committed once the team stakes the product on it; Verified once a run demonstrates it holds in evidence; Broken when a run falsifies it. Never edit an accepted claim's text in place - supersede instead, so the run that verified it still refers to the same sentence.

verification method

verification method #

How a claim is checked, from IEEE 29148's closed set: Inspection (read the code/doc) · Analysis (model it) · Demonstration (exercise and observe) · Test (automated assertion) · Judge (rubric-graded by LLM or human) · Field (observation in production). Each claim names its method in the contract's verification block; choose the lightest method that can falsify.

Note

Test here is one of six verification methods - a specific, bounded term from IEEE 29148. It is not the banned layer-noun "test" (see "Words we deliberately don't use"); don't use "test" as a synonym for probe/segment.

falsifier

falsifier #

The observable event that, if seen, refutes a claim. Stated as a positive, concrete event observable from outside the SUT (ADR-0034 - negative claims are observational, not adversarial). Write the thing you would see that means the claim is wrong.

Good: "a returned job whose title matches none of the query synonyms, judged by ≥2 of 3 LLM voters." Bad: "the system fails" - not observable, not specific. A passive falsifier ("if it breaks, we'll notice") is not a falsifier. Falsifier triggered → gap recorded.

gap

gap #

An instance of a falsified claim, attributed to a specific run. A GapRow at runs/<traceId>/gaps.json (src/lib/gap-log.ts, ADR-0027), carrying runId, ts, scenario, claim (the claim ID), observed, expected, and evidence. Verifier code calls appendGap({...}) when a falsifier fires - one gap per falsified claim instance per run. A gap with no evidence block is untriageable.

A gap is a finding with a claim attribution - Gap ⊂ Finding.

finding

finding #

Any noteworthy observation in a run - broader than a gap. A finding may or may not cite a claim: it includes gaps (falsified claims), surprises (unexpected SUT behavior not yet in a contract), and operational notes. Rendered into per-run reports and a sweep's findings.md. Findings are the superset; gaps are the claim-attributed subset.

evidence

evidence #

One typed artifact a run produced that supports or refutes a claim - a screenshot, a log file, a scraped row, a score, a Loki link, a video. A discriminated union on kind (file | link | text | row | metric | media) at src/lib/evidence.ts, pinned by ADR-0051. Each piece carries an id, a caption, optional stepId provenance (which step/span produced it), optional claimIds (which claims it bears on), and a kind-discriminated payload; file/media kinds also carry court-grade provenance (sha256, sizeBytes, contentType).

Not a gap (a gap cites evidence via claimIds). Not the instrument (the instrument is the means of observing; evidence is what the observation captured - provenance lives on the evidence, per ADR-0051).

Inputs - how a run is parameterized #

archetype

archetype #

A query family used by the JSI corpus to exercise the same job-seeker intent across multiple phrasings. Archetypes are the outer grouping; each concrete query is one typed prompt inside that family.

persona

persona #

The canonical human-readable name for a simulated user (e.g. "P1 - Mid-Senior IC Engineer (US)"). One entry in src/systems/metaintro-chat/corpus/profiles.json, carrying an id, label, detail, and onboarding block. Five personas today (P1-P5) spanning easy → hard with global coverage.

A persona characterizes a user-kind client - it pinpoints how that client uses the system. It is the external (UI/report/stakeholder) name for what the code calls a profile.

profile

profile #

The data shape backing a persona - the actual JSON object loaded from profiles.json (id, label, detail, onboarding). Conceptually identical to a persona; profile is the internal word, persona is the external word. Convention: in code and storage, say profile; in UI, reports, and stakeholder conversation, say persona; never both in one paragraph. A profile parameterizes a run.

onboarding

onboarding #

The pre-condition account state a run needs - the answers the SUT's onboarding flow expects (careerStage, targetRole, locationPreference, skills, …), held in each profile's onboarding block. Onboarding state must be true before a query can run; readiness is a first-class preflight step (ADR-0024). Account-level state, set once per profile - distinct from the per-run query.

query

query #

The user's typed input into the SUT - the prompt that triggers the scenario's drive phase (e.g. "senior react engineer remote"). For sweeps, queries come from a corpus (docs/research/jsi-gold-set-v1.json). Load-bearing input: the verdict only makes sense relative to the query. Per-run - distinct from per-profile onboarding.

Diagnostics - sweep & activation metrics #

scorecard

scorecard #

A single snapshot of the activation system's measured behavior, computed by computeScorecard(rows) in src/lib/scorecard.ts (PRD-23 §23.1.1). Carries the North Star metric (EAR) plus Tier 2 driver rates plus Tier 3 diagnostics (cohort breakdown, lever-firing rates and counts, terminal distribution). Pure function - same input rows → same output scorecard, regardless of surface (Markdown, JSON, ClickHouse row).

A scorecard is the output of one sweep run (written to sqa.sqa_sweep_summary) or of one ClickHouse aggregation over a tail window. Independent of surface and of provenance.

Explained Activation Rate (EAR)

Explained Activation Rate (EAR) #

The North Star metric for snappy's activation system:

text

EAR = (active + blocked + canonical-merged) / (total - transport-error)

The numerator is "snappy reached a state we can explain." The denominator excludes transport-error because that's an SQA-side fault, not a snappy-side outcome - counting it would penalise SQA for its own observability failures.

Rationale and Tier 2 driver tree pinned in docs/research/2026-05-13-activation-north-star-kpi.md ↗ §2. First baseline measured 2026-05-13 against snappy 0.16.0 on F500: 87.60% (438/500).

Tier 2 driver

Tier 2 driver #

One of the four metrics that decompose EAR's gap from 1: silent-unresolved rate, probe-body-undelivered rate, transport-error rate, and cohort-other rate. Each answers a different "why didn't activation succeed?" question - silent-unresolved is "snappy gave up without telling us why," probe-body-undelivered is "snappy reached the site but couldn't read it," transport-error is "SQA couldn't ask snappy," and cohort-other is "snappy said something we haven't classified yet." They live on Scorecard.tier2 and propagate into sqa.sqa_sweep_summary columns for trend queries.

metric vs KPI

metric vs KPI #

A metric is anything quantitative a run emits. A KPI is a metric promoted to load-bearing: KPI = metric × promise. Promotion needs three things - the contract names it, a band gives it a verdict story, and a regression in it changes an outcome. The same number can be a metric in one probe and a KPI in another. Full treatment, with the metrics catalog and drift causes: glossary-deep-dives.md §3.

Words we deliberately don't use

Words we deliberately don't use #

check - too broad; every conditional is a "check" in English.

Use "probe" for the unit, "outcome" for the verdict. (Exception: Test and the other IEEE-29148 verification methods are bounded contract-side terms, not layer nouns.)

monitor - commercial-uptime-tooling vocabulary (New Relic,

BetterStack, Uptime Kuma); not our register.

suite - test-framework baggage; "segment" is what we mean.
test - overloaded with code-test connotations; SQA is not the

unit-test layer. (The Test verification method is the one bounded exception.)

flow / preflight as a layer word - preflight is a

kind of segment (the one that runs first); other segments won't be preflights. As a layer word, use "segment."

target as a layer word - targets.ts already names the env

→ URL/credentials map; using it as a synonym for SUT would collide.

Glossary #

How the layers fit together #

How a run becomes a verdict #

Structure - the layers of a run #

probe #

scenario #

segment #

run #

trace #

trial #

sweep #

continuous tail #

step #

span #

SUT #

component #

instrument #

Durable Execution Engine (DEE) #

Verdict - what a run produces #

outcome #

score #

band #

verdict #

judge #

axis #

JSI #

relevancy #

coverage #

nDCG #

Contract - what a run judges against #

client #

contract #

claim #

strength #

status #

verification method #

falsifier #

gap #

finding #

evidence #

Inputs - how a run is parameterized #

archetype #

persona #

profile #

onboarding #

query #

Diagnostics - sweep & activation metrics #

scorecard #

Explained Activation Rate (EAR) #

Tier 2 driver #

metric vs KPI #

Words we deliberately don't use #

See also #