← All runslive api
The contract
kai’s promise: store what matters and recall it on demand — surface the right memory for the ask, ranked, with no relevant memory missed.
The claim SQA tests: The jobs the metaintro-chat search engine returned, in the chat thread, are relevant to what the user asked for. (claim C0).
This run tested the system against its contract, clause by clause. A single run can only witness some clauses; the rest stay UNKNOWN — never a faked pass.
1 pass · 0 fail · 7 unknown
- C0 MUSTPASSHeadline promiserelevancy score = 75/100 (pass-band ≥ 60)
- C1 MUSTUNKNOWNUser can sign inno 'login' step in this run
- C2 MUSTUNKNOWNUser can open a new threadno 'open-thread' step in this run
- C3 SHOULDUNKNOWNOnboarding gate completesno 'onboarding' step in this run
- C4 SHOULDUNKNOWNFilters from onboarding don't bias the queryno 'clear-filters' step in this run
- C5 MUSTUNKNOWNUser-typed query is what the engine seesno 'submit-query' step in this run
- C10 SHOULDUNKNOWNScore holds across rerunsneeds a sweep — a single run cannot witness this clause — needs a sweep
- C12 MAYUNKNOWNRun completes within budgetneeds a sweep — a single run cannot witness this clause — needs a sweep
TL;DR · 30-second primer
- ·KAI (SUT) ran 1 run on profile longmem-phase-a.
- ·Result: Memory Recall Index 75/100. Strong recall— see “Why this verdict” (each gap maps to a claim in the Contract).
1 ·THE VERDICT
the answer in one number30-day MRI history
KAI · MEMORY RECALL · RUN #26
Strong recall.
Run #26 of kai on profile longmem-phase-a for the query "MRI sweep — memory-recall (real, recall@5)". Memory Recall Index 75/100.
Verdict PASS: every step completed cleanly; nothing pulled the verdict down.
AI synthesis · openai/gpt-4o-mini
The system successfully completed its task, achieving a pass with a JSI score of 75 out of 100 in the memory-recall scenario. The outcome was driven by a strong performance in the memory-recall step, where it correctly recalled 3 out of 4 items. However, there was a failure in the snappy deployment step due to a lack of matching memory in the top-k results.
2 ·WHY THIS VERDICT
ranked by severity3 ·THE STORY
what went in, what came outInput
what the probe sent inQuery
5 ·SESSION RECORDING
watch what the probe sawSession recording
watch what the probe actually sawNo recording available for Metaintro.
6 ·RUN MECHANICS
provenance & reproducibilityDuration
4.00s
Steps
5
Judges
—
Commit
kai-recallStarted
2026-05-28 20:24 UTC
Trace
Evidence by step
every artifact, link, excerpt, row, metric & recording — grouped by the step that produced itNo evidence recorded for this run.
Evidence integrity
each artifact is SHA-256 hashed at capture — proof it is unmodifiedNo integrity manifest recorded for this run.
7 ·SYSTEM ANATOMY
which component drove the verdictEvery component held — no failure attributed.
Press ⌘K to search