metaintro-chat · job-search
- contractVersion: 2.0.0
Contract - metaintro-chat job-search journey #
The previous contract version (1.x) modeled this scenario as a sequence of verifier-shaped claims (C1 shape, C2 relevancy, C3 coverage, JSI composite). That vocabulary served the runner but failed the reader: a person opening a report saw metaintro-chat.c1.shape: warn - no job cards returned and had to translate to a user-facing question. v2.0.0 rewrites the contract from the client's point of view. There is one claim. The steps are how we get into a position to verify it. The verdict is a single number.
The single claim #
The jobs the metaintro-chat search engine returned, in the chat thread, are relevant to what the user asked for.
That's it. Everything else in this document is in service of verifying that one claim.
The verdict - relevancy score (0-100) #
The report's headline is a single number from 0 to 100.
- 100 - every returned job is a relevant answer to the user's
query, judged from the user's POV.
- 0 - no returned job matches the query.
- In between - partial relevance. The score is the LLM-judged mean
per-job relevancy across the result set (see verifier in §4).
The verdict is not pass/warn/fail. It is the number. A reader who sees "Relevancy: 73 / 100" knows immediately what happened. Compare against the prior trial's relevancy for trend.
Trigger #
User opens https://www.metaintro.com/login, signs in with valid credentials, completes any required onboarding, navigates to a new thread, and submits a job-search query as their first message.
The journey (4 steps) #
1. login User signs in with email + password.
2. onboarding User completes the onboarding flow if the
account is fresh (career stage, target role,
location preference). Already-onboarded accounts
skip cleanly.
3. search User starts a new thread and submits a query
("senior react engineer remote").
4. evaluate Job cards render in the assistant's response.
We capture them, score per-job relevancy,
aggregate to the headline score.Each step is observable from the user's POV, not the runner's. Step 4 produces the verdict.
What we capture (evidence allowlist) #
The previous contract version recorded everything Playwright emitted - including a 28MB HAR per trial. That's noise. v2.0.0 narrows to evidence that supports the single claim:
Always captured:
- One continuous video of the entire journey
(video.webm, ~500 KB per trial).
- Screenshots at landmark moments: login page, post-login
redirect, onboarding completion, thread creation, query submission, first job card visible.
- Console errors and warnings (filtered: drop
info/debug). - The thread id and the search query as plain strings.
- The job result set: array of `{ title, company, location,
url, posted_on, … }` extracted from the chat DOM.
- The per-job relevancy scores + reasons from the LLM judge.
Filtered network capture (only):
POST /api/auth/*(login flow)POST /api/threads*(thread creation)POST /api/thread-chat*(the search query)GET/POST /api/jobs*or any URL containing/job-matching/or
/jme/ (job search engine calls)
Dropped entirely:
- All third-party requests (Stripe, Vercel analytics,
Sentry, Mux, etc.)
- All
_next/static/*chunk fetches - Fonts, images, prefetches.
- Full HAR snapshots. We keep the filtered JSONL only.
Verification map #
| Step | Claim contribution | Verifier |
|---|---|---|
| 1 login | User can reach the post-login state. If login fails the report cannot judge relevancy at all - score is null + outcome is "could-not-evaluate". | verifiers/login-reached.ts |
| 2 onboarding | User can reach a thread-ready state. Same null-score semantics on failure. | verifiers/onboarding-completed.ts |
| 3 search | The query was submitted and the chat returned an assistant turn containing job cards. If zero cards: relevancy is defined as 0 (the search produced no answers), not null. | verifiers/jobs-returned.ts |
| 4 evaluate | The single load-bearing claim. LLM judge scores per-job relevancy 0-100 against the query string, aggregates to mean, names which jobs scored below threshold and why. | verifiers/relevancy.ts |
The processed-not-surfaced rule #
The report MUST go beyond "here's the data." If relevancy is below 100, the report MUST name which specific jobs were judged irrelevant and why. The reader should not have to scan the result set themselves to find the bad matches.
Surfaced fields per irrelevant job (in the gap card / report body):
- Job title + company + location (so the reader recognizes it)
- Relevancy score for this specific job (0-100)
- Reason the judge marked it down (e.g. "location mismatch:
query asked for remote, job is on-site in Bangalore")
- Link to the job posting (so the reader can verify)
Outcome semantics #
- score: number (0-100) when steps 1-3 succeeded and the
judge ran. This is the headline.
- score: null when steps 1-3 did not reach the search-result
stage. The report says "could not evaluate" with the reason. Distinct from score: 0.
- score: 0 when search succeeded but no jobs came back, OR
every job was judged irrelevant. Distinct from null.
Trend #
The verdict's full meaning includes how this run compares to prior runs of similar queries. The report renders a sparkline / delta for the same query string + profile across the last N runs, so the reader sees regression at a glance.
What this contract deliberately does NOT promise #
- Latency, availability, error rates of the
underlying APIs. Those are observability concerns, not claim verifications. The filtered network log carries the data; the report just doesn't grade it.
- Job freshness, diversity, deduplication. These
could be future claims. v2.0.0 covers relevancy only.
- Multi-turn behavior. This contract scores the FIRST job-
search result. A follow-up "show me only remote ones" is a different scenario.
Source #
This contract was rewritten on 2026-05-26 after the first production W11 sweep surfaced that the verifier-shaped vocabulary (drive / observe / c1.shape / c2.relevancy / c3.coverage / jsi) did not communicate value to a reader. The single-claim model collapses six derived signals into one user-language verdict.
The previous v1.x contract lives in git log but is superseded. The unified sqa.sqa_runs schema accepts both - outcome stays the categorical floor, result_json carries the score as result.context.relevancyScore. The viewer reads the score field directly and renders it as the headline.