Skip to content
SQA Cockpit

Why SQA exists

LearnUpdated 6 min readEdit on GitHub ↗

Problem statement #

The bedrock truth this project exists to address. Sources: the owner's private value-precision research notes (not shipped) - case study 001-metaintro-sqa-tool, synthetic drift, insight depth ladder. The relevant findings are restated below; the original notes are not necessary to follow the rest of this document.

What the name means #

SQA = System Quality Assurance. Not code QA (unit tests, type checks), not workflow QA (E2E sequences) - system QA. The system is the highest container: the running application plus its network, storage, queues, deploys, and the business outcome each is supposed to produce.

The reason for picking "system" is empirical: the failures we ship to production are rarely "this function returned the wrong number". They look like "every test was green and the crawler still hasn't ingested anything for six hours". Catching those needs probes that judge across layers - application, network, storage, scheduler - against the business outcome, not against a passing assertion.

That sets the container. The next section sets the kind of judgment we make inside it.

The failure mode #

In April 2026 we built an SQA tool to verify Metaintro was investor-ready before a fundraising campaign. The tool ran 13 stages of automated tests, returned scores in the 75-86 GREEN range, and we still weren't confident.

The case study found the L4 root cause:

We built a measuring system before defining what success looks like to the audience that matters. The tool measures system health, but the problem is audience impression - and those are fundamentally different things.

A GREEN score told us the plumbing worked. It told us nothing about whether a first-time visitor would walk away thinking the product was valuable. Risk mitigation is necessary but insufficient; a product that doesn't crash but doesn't impress is the worst outcome.

What SQA is - and isn't #

SQA is ongoing operational verification of whichever software systems and business workflows we point it at - the layer the case study identified as a real, ongoing need (Snappy was the first system wired up; more will follow):

The SQA tool IS useful for ongoing operational monitoring (is the Chat Engine down? is auth broken?) … But operational monitoring ≠ investor readiness testing.

The "What the name means" section above sets the container (system, not code or workflow). This section narrows the kind of judgment we make inside that container - operational outcomes only:

SQA's jobNOT SQA's job
"Is the API responding?""Is the AI's career advice good?"
"Are S3 / Mongo / CH reachable?""Are the first 3 jobs shown impressive?"
"Did the deploy roll out?""Will an investor invest after using it?"
"Are the CRON jobs running?""Does the conversation flow create pull?"

Audience-impression questions belong in a separate tool (golden-path scenarios with manual or LLM-judged quality scoring). Conflating the two is the original drift. Don't add audience-quality probes to SQA - start a new project.

Note
Scope update (2026-05-28, ADR-0052). The line above has moved. SQA now also verifies value - graded, LLM-judged quality of a system's output - when that quality is written as a falsifiable claim in a contract (e.g. the metaintro-chat job-search relevancy score). The guardrail against meaningless GREEN scores survives, relocated from a blanket ban on quality scoring to the falsifier requirement: a score is only legitimate if a claim names what would refute it. What stays out of scope is ungrounded audience impression - quality judgments with no claim, no client, and no falsifier. The "is the AI's advice good?" examples above are out of scope only when written that way; as a contract claim with a rubric and falsifier, the relevancy of returned jobs is exactly what SQA verifies today.

Falsifiability #

This problem statement is L4 if and only if:

  • (a) Adding more infrastructure checks to SQA never moves the needle on

"will an investor invest?" - predicted in the case study, holds.

  • (b) The two concerns (operational health vs. audience impression) do

not converge into one tool over time. If you find yourself adding "is the AI helpful?" to SQA, the drift is back.

Revisit this document if either prediction breaks.

How this shapes the design #

Every architectural decision below cascades from the scope above:

DecisionWhy
Three layers (lib/components/systems)Probes (under components/) are system checks, not experience checks. The split makes the boundary structural. - ADR-0001 · ADR-0009
Result-envelope returns from probesAn ops probe says pass/warn/fail/error/skip. Quality scoring would lie about its precision here. - ADR-0007 · ADR-0012
Numbered step IDs (1.1, 1.2)Postmortem-friendly. The case study showed the value of stable references. - ADR-0005
make gate LLM doc-sync hookPrevents the next drift: docs claiming the tool does something it doesn't.
One folder per system under test, runner stays genericAdding a system = adding a folder under src/systems/, never expanding the runner. - ADR-0008

The remaining ADRs (env validation, runtime targets, log-format axis, component-signature shape) cover plumbing that follows from the same scope. See the ADR index ↗ for the full list.


Side note - strategic framing (Natan, 2026-05-28) #

Not yet ratified into VALUE.md or an ADR. Parked here to revisit.

SQA reframed for metaintro: the product/business-manager's compass. It takes the hard problems we're solving (metaintro-chat, the Job Search Index, etc.), produces metrics, runs benchmarks against the market (LinkedIn / Indeed / Google), and gives the team a verdict they can steer by - turning "is our chat any good?" into "here's where we stand vs the competition, here's the trend, here's the gap costing us seekers."

The bet behind investing here: in a near future where LLMs write code almost perfectly, writing code stops being the differentiator. What matters is product judgment - understanding your clients, knowing how good your product is versus the market, measuring system and client-experience performance overall. Software that helps you manage that (like SQA) is worth more than software that helps you write code. SQA is a bet on the measurement-and-judgment layer over the code-authoring layer.

Implication if adopted: primary reader shifts from eng → PM/business decision-maker; benchmarks become the spine, not a feature; scope widens from one SUT to the whole product portfolio under one consistent lens.

Was this page helpful?