Add catalog discovery evals derived from SQL benchmark¶

Problem¶

Adapt the dbt SQL benchmark into search/catalog discovery eval cases by extracting expected tables from gold SQL and generating one or more search queries per case. Reuse the proven scoring model of recall@k, hit rate@k, and MRR, but wire it to Dataface search/catalog retrieval instead of ContextCatalog-specific runners.

Context¶

Repo boundary: this lives in Dataface¶

Like all eval tasks, this lives in the Dataface repo under apps/evals/catalog/. The eval prep step (apps/evals/catalog/prep.py — extract expected tables from gold SQL, generate search queries), the retrieval runner (apps/evals/catalog/runner.py), and the IR scorer (apps/evals/catalog/scorer.py) all live there. Results go to apps/evals/output/catalog/. The cleaned benchmark input comes from apps/evals/data/ (task 2).

The unified apps/evals/ CLI runs catalog evals via python -m apps.evals catalog ....

Scoring model: information retrieval metrics¶

This uses a completely different scoring model than the text-to-SQL eval (task 3):

recall@k — what fraction of expected tables appear in the top-k results?
hit rate@k — does at least one expected table appear in top-k?
MRR (mean reciprocal rank) — how high does the first correct result rank?

Existing prior art¶

cto-research/context_catalog/evals/search_eval/prepare_dataset.py already does the core transformation: extract expected tables from gold SQL, generate search queries per table. Port this approach.

What this evaluates¶

The retrieval step before generation — can Dataface's catalog/search surface the right tables given a natural-language question? Text-to-SQL usually fails at retrieval, not generation. If catalog search can't find the right table, the SQL generator never had a chance.

The Dataface search surface is search_dashboards in dataface/ai/mcp/search.py and the catalog tool in dataface/ai/mcp/tools.py. This eval tests whether those tools (or future dedicated table-search tools) return the expected tables.

Dependencies¶

Input comes from task 2 (cleaned benchmark in apps/evals/data/).
Does not depend on task 3's runner framework — this is a separate, simpler eval loop with its own retrieval-specific scorer.

Possible Solutions¶

Plan¶

Implementation Progress¶

QA Exploration¶

QA exploration completed (or N/A for non-UI tasks)

N/A for browser QA. Validation: run against a sample of benchmark cases, verify recall@k and MRR outputs match manual spot-checks.

Review Feedback¶

Review cleared