- C0 MUSTPASSHeadline promiserelevancy score = 67/100 (pass-band ≥ 60)
- C1 MUSTUNKNOWNUser can sign inno 'login' step in this run
- C2 MUSTUNKNOWNUser can open a new threadno 'open-thread' step in this run
- C3 SHOULDUNKNOWNOnboarding gate completesno 'onboarding' step in this run
- C4 SHOULDUNKNOWNFilters from onboarding don't bias the queryno 'clear-filters' step in this run
- C5 MUSTUNKNOWNUser-typed query is what the engine seesno 'submit-query' step in this run
- C10 SHOULDUNKNOWNScore holds across rerunsneeds a sweep — a single run cannot witness this clause — needs a sweep
- C12 MAYUNKNOWNRun completes within budgetneeds a sweep — a single run cannot witness this clause — needs a sweep
- ·KAI (SUT) ran 1 run on profile longmem-phase-a.
- ·Result: Memory Recall Index 67/100. Partial recall— see “Why this verdict” (each gap maps to a claim in the Contract).
1 ·THE VERDICT
the answer in one numberPartial recall.
Run #19 of kai on profile longmem-phase-a for the query "MRI sweep — memory-recall". Memory Recall Index 67/100.
Verdict FAIL: C1 Recall dropped to 0.68 on multi-hop queries — chains broke.
The system did not perform its job effectively, resulting in a failure with a Memory Recall Index (MRI) score of 67 out of 100. The primary issue was that recall dropped to 0.68 on multi-hop queries, indicating that the chains of information broke down. Additionally, there was a warning regarding abstention precision, as the system answered questions when it should have declined. Overall, the performance was inadequate, leading to a failed outcome.
2 ·WHY THIS VERDICT
ranked by severityAbstention precision fell — kai answered when it should have declined
3 ·THE STORY
what went in, what came outInput
what the probe sent in5 ·SESSION RECORDING
watch what the probe sawSession recording
watch what the probe actually saw6 ·RUN MECHANICS
provenance & reproducibilitydemo-seedEvidence by step
every artifact, link, excerpt, row, metric & recording — grouped by the step that produced itNo evidence recorded for this run.
Evidence integrity
each artifact is SHA-256 hashed at capture — proof it is unmodifiedNo integrity manifest recorded for this run.
7 ·SYSTEM ANATOMY
which component drove the verdictVerdict driven by Unattributed. The ringed, pulsing nodes are the components SQA attributes the failure to.