Skip to content
SQA Cockpit
← All runslive api
The contract
kai’s promise: store what matters and recall it on demand — surface the right memory for the ask, ranked, with no relevant memory missed.
The claim SQA tests: The jobs the metaintro-chat search engine returned, in the chat thread, are relevant to what the user asked for. (claim C0).
This run tested the system against its contract, clause by clause. A single run can only witness some clauses; the rest stay UNKNOWN — never a faked pass.
1 pass · 0 fail · 7 unknown
  • C0 MUST
    Headline promise
    relevancy score = 67/100 (pass-band ≥ 60)
    PASS
  • C1 MUST
    User can sign in
    no 'login' step in this run
    UNKNOWN
  • C2 MUST
    User can open a new thread
    no 'open-thread' step in this run
    UNKNOWN
  • C3 SHOULD
    Onboarding gate completes
    no 'onboarding' step in this run
    UNKNOWN
  • C4 SHOULD
    Filters from onboarding don't bias the query
    no 'clear-filters' step in this run
    UNKNOWN
  • C5 MUST
    User-typed query is what the engine sees
    no 'submit-query' step in this run
    UNKNOWN
  • C10 SHOULD
    Score holds across reruns
    needs a sweep — a single run cannot witness this clause — needs a sweep
    UNKNOWN
  • C12 MAY
    Run completes within budget
    needs a sweep — a single run cannot witness this clause — needs a sweep
    UNKNOWN
TL;DR · 30-second primer
  • ·KAI (SUT) ran 1 run on profile longmem-phase-a.
  • ·Result: Memory Recall Index 67/100. Partial recall— see “Why this verdict” (each gap maps to a claim in the Contract).

1 ·THE VERDICT

the answer in one number
30-day MRI history
KAI · MEMORY RECALL · RUN #19

Partial recall.

Run #19 of kai on profile longmem-phase-a for the query "MRI sweep — memory-recall". Memory Recall Index 67/100.

Verdict FAIL: C1 Recall dropped to 0.68 on multi-hop queries — chains broke.

AI synthesis · openai/gpt-4o-mini

The system did not perform its job effectively, resulting in a failure with a Memory Recall Index (MRI) score of 67 out of 100. The primary issue was that recall dropped to 0.68 on multi-hop queries, indicating that the chains of information broke down. Additionally, there was a warning regarding abstention precision, as the system answered questions when it should have declined. Overall, the performance was inadequate, leading to a failed outcome.

2 ·WHY THIS VERDICT

ranked by severity
HARD

Recall dropped to 0.68 on multi-hop queries — chains broke

Expected
Multi-hop recall@10 ≥ 0.78
Observed
Multi-hop recall@10 collapsed to 0.54; the reranker rescore change pruned bridge documents before the second hop could resolve
Why it matters
Broken chains mean kai silently answers from a single document instead of synthesizing across the decision → supersession → session graph. Per VALUE.md, kai must never silently fill the gap with lower-level recall.
Recommended action· 2 sprints
Revert reranker rescore weight; raise prefetch K to 80 for multi-hop intents; gate the change behind the Phase-A benchmark.
✓ verifiedjudge: recall@k + multi-hop + rerank
SOFT

Abstention precision fell — kai answered when it should have declined

Expected
Abstention precision ≥ 0.80 on out-of-corpus queries
Observed
Precision = 0.66; kai confabulated on 1-in-3 out-of-corpus probes
Why it matters
A memory layer that answers when it shouldn't is worse than one that abstains — confident wrong recall poisons downstream agents.
Recommended action· 1 sprint
Restore the top-score < 0.5 abstention gate that the rescore change bypassed.
✓ verifiedjudge: abstention

3 ·THE STORY

what went in, what came out

Input

what the probe sent in
Query
Skills
(no skill inferred)
ESCO
Industry
Computer Systems Design
NAICS 541512
Location
United States
ISO US
Education
Bachelor or equivalent
ISCED ISCED 6

5 ·SESSION RECORDING

watch what the probe saw

Session recording

watch what the probe actually saw
No recording available for Metaintro.

6 ·RUN MECHANICS

provenance & reproducibility
Duration
51.87s
Steps
3
Judges
Commit
demo-seed
Started
2026-05-14 13:00 UTC
Trace

Evidence by step

every artifact, link, excerpt, row, metric & recording — grouped by the step that produced it

No evidence recorded for this run.

Evidence integrity

each artifact is SHA-256 hashed at capture — proof it is unmodified

No integrity manifest recorded for this run.

7 ·SYSTEM ANATOMY

which component drove the verdict

Verdict driven by Unattributed. The ringed, pulsing nodes are the components SQA attributes the failure to.

Press ⌘K to search