← All runslive api

The contract

kai’s promise: store what matters and recall it on demand — surface the right memory for the ask, ranked, with no relevant memory missed.

The claim SQA tests: The jobs the metaintro-chat search engine returned, in the chat thread, are relevant to what the user asked for. (claim C0).

This run tested the system against its contract, clause by clause. A single run can only witness some clauses; the rest stay UNKNOWN — never a faked pass.

1 pass · 0 fail · 7 unknown

C0 MUST
Headline promise
relevancy score = 67/100 (pass-band ≥ 60)
PASS
C1 MUST
User can sign in
no 'login' step in this run
UNKNOWN
C2 MUST
User can open a new thread
no 'open-thread' step in this run
UNKNOWN
C3 SHOULD
Onboarding gate completes
no 'onboarding' step in this run
UNKNOWN
C4 SHOULD
Filters from onboarding don't bias the query
no 'clear-filters' step in this run
UNKNOWN
C5 MUST
User-typed query is what the engine sees
no 'submit-query' step in this run
UNKNOWN
C10 SHOULD
Score holds across reruns
needs a sweep — a single run cannot witness this clause — needs a sweep
UNKNOWN
C12 MAY
Run completes within budget
needs a sweep — a single run cannot witness this clause — needs a sweep
UNKNOWN

TL;DR · 30-second primer

·KAI (SUT) ran 1 run on profile longmem-phase-a.
·Result: Memory Recall Index 67/100. Partial recall— see “Why this verdict” (each gap maps to a claim in the Contract).

1 ·THE VERDICT

the answer in one number

30-day MRI history

KAI · MEMORY RECALL · RUN #19

Partial recall.

Run #19 of kai on profile longmem-phase-a for the query "MRI sweep — memory-recall". Memory Recall Index 67/100.

Verdict FAIL: C1 Recall dropped to 0.68 on multi-hop queries — chains broke.

AI synthesis · openai/gpt-4o-mini

The system did not perform its job effectively, resulting in a failure with a Memory Recall Index (MRI) score of 67 out of 100. The primary issue was that recall dropped to 0.68 on multi-hop queries, indicating that the chains of information broke down. Additionally, there was a warning regarding abstention precision, as the system answered questions when it should have declined. Overall, the performance was inadequate, leading to a failed outcome.

2 ·WHY THIS VERDICT

ranked by severity

HARD

Recall dropped to 0.68 on multi-hop queries — chains broke

Expected

Multi-hop recall@10 ≥ 0.78

Observed

Multi-hop recall@10 collapsed to 0.54; the reranker rescore change pruned bridge documents before the second hop could resolve

Why it matters

Broken chains mean kai silently answers from a single document instead of synthesizing across the decision → supersession → session graph. Per VALUE.md, kai must never silently fill the gap with lower-level recall.

Recommended action· 2 sprints

Revert reranker rescore weight; raise prefetch K to 80 for multi-hop intents; gate the change behind the Phase-A benchmark.

SOFT

Abstention precision fell — kai answered when it should have declined

Expected

Abstention precision ≥ 0.80 on out-of-corpus queries

Observed

Precision = 0.66; kai confabulated on 1-in-3 out-of-corpus probes

Why it matters

A memory layer that answers when it shouldn't is worse than one that abstains — confident wrong recall poisons downstream agents.

Recommended action· 1 sprint

Restore the top-score < 0.5 abstention gate that the rescore change bypassed.

3 ·THE STORY

what went in, what came out

Input

what the probe sent in

Query

Skills

(no skill inferred)

ESCO —

Industry

Computer Systems Design

NAICS 541512

Location

United States

ISO US

Education

Bachelor or equivalent

ISCED ISCED 6

5 ·SESSION RECORDING

watch what the probe saw

Session recording

watch what the probe actually saw

No recording available for Metaintro.

6 ·RUN MECHANICS

provenance & reproducibility

Duration

51.87s

Steps

Judges

—

Commit

demo-seed

Started

2026-05-14 13:00 UTC

Trace

Evidence by step

every artifact, link, excerpt, row, metric & recording — grouped by the step that produced it

No evidence recorded for this run.

Evidence integrity

each artifact is SHA-256 hashed at capture — proof it is unmodified

No integrity manifest recorded for this run.

7 ·SYSTEM ANATOMY

which component drove the verdict

Verdict driven by Unattributed. The ringed, pulsing nodes are the components SQA attributes the failure to.

Press ⌘K to search