kai’s promise: store what matters and recall it on demand — surface the right memory for the ask, ranked, with no relevant memory missed.
The claim SQA tests: The jobs the metaintro-chat search engine returned, in the chat thread, are relevant to what the user asked for. (claim C0).
This run tested the system against its contract, clause by clause. A single run can only witness some clauses; the rest stay UNKNOWN — never a faked pass.
1 pass · 0 fail · 7 unknown
C0 MUST
Headline promise
relevancy score = 83/100 (pass-band ≥ 60)
PASS
C1 MUST
User can sign in
no 'login' step in this run
UNKNOWN
C2 MUST
User can open a new thread
no 'open-thread' step in this run
UNKNOWN
C3 SHOULD
Onboarding gate completes
no 'onboarding' step in this run
UNKNOWN
C4 SHOULD
Filters from onboarding don't bias the query
no 'clear-filters' step in this run
UNKNOWN
C5 MUST
User-typed query is what the engine sees
no 'submit-query' step in this run
UNKNOWN
C10 SHOULD
Score holds across reruns
needs a sweep — a single run cannot witness this clause — needs a sweep
UNKNOWN
C12 MAY
Run completes within budget
needs a sweep — a single run cannot witness this clause — needs a sweep
UNKNOWN
TL;DR · 30-second primer
·KAI (SUT) ran 1 run on profile longmem-phase-a.
·Result: Memory Recall Index 83/100. Strong recall— see “Why this verdict” (each gap maps to a claim in the Contract).
1 ·THE VERDICT
the answer in one number
30-day MRI history
KAI · MEMORY RECALL · RUN #15
Strong recall.
Run #15 of kai on profile longmem-phase-a for the query "MRI sweep — memory-recall". Memory Recall Index 83/100.
Verdict PASS: every step completed cleanly; nothing pulled the verdict down.
AI synthesis · openai/gpt-4o-mini
The system successfully completed the memory-recall run, achieving a Memory Recall Index (MRI) score of 83 out of 100. This strong performance was driven by the overall effectiveness of the memory-recall process, although it was noted that multi-hop recall trailed single-hop recall by 14 percentage points, indicating some room for improvement. The entire operation took 41.2 seconds to complete.
2 ·WHY THIS VERDICT
ranked by severity
SOFT
Multi-hop recall trails single-hop by 14pp
Expected
Multi-hop recall@10 within 8pp of single-hop
Observed
Single-hop recall@10 = 0.88, multi-hop = 0.74 on the LongMemEval-style corpus
Why it matters
Multi-hop is where institutional memory earns its keep — chaining a decision to its superseding decision and the session that drove it.
Recommended action· 1 sprint
Raise hybrid prefetch K for multi-hop intents; tune RRF k-constant.