Add catalog discovery evals derived from SQL benchmark¶
Problem¶
Adapt the dbt SQL benchmark into search/catalog discovery eval cases by extracting expected tables from gold SQL and generating one or more search queries per case. Reuse the proven scoring model of recall@k, hit rate@k, and MRR, but wire it to Dataface search/catalog retrieval instead of ContextCatalog-specific runners.
Context¶
Repo boundary: this lives in Dataface¶
Like all eval tasks, this lives in the Dataface repo under apps/evals/catalog/. The eval prep step (apps/evals/catalog/prep.py — extract expected tables from gold SQL, generate search queries), the retrieval runner (apps/evals/catalog/runner.py), and the IR scorer (apps/evals/catalog/scorer.py) all live there. Results go to apps/evals/output/catalog/. The cleaned benchmark input comes from apps/evals/data/ (task 2).
The unified apps/evals/ CLI runs catalog evals via python -m apps.evals catalog ....
Scoring model: information retrieval metrics¶
This uses a completely different scoring model than the text-to-SQL eval (task 3):
- recall@k — what fraction of expected tables appear in the top-k results?
- hit rate@k — does at least one expected table appear in top-k?
- MRR (mean reciprocal rank) — how high does the first correct result rank?
Existing prior art¶
cto-research/context_catalog/evals/search_eval/prepare_dataset.py already does the core transformation: extract expected tables from gold SQL, generate search queries per table. Port this approach.
What this evaluates¶
The retrieval step before generation — can Dataface's catalog/search surface the right tables given a natural-language question? Text-to-SQL usually fails at retrieval, not generation. If catalog search can't find the right table, the SQL generator never had a chance.
The Dataface search surface is search_dashboards in dataface/ai/mcp/search.py and the catalog tool in dataface/ai/mcp/tools.py. This eval tests whether those tools (or future dedicated table-search tools) return the expected tables.
Dependencies¶
- Input comes from task 2 (cleaned benchmark in
apps/evals/data/). - Does not depend on task 3's runner framework — this is a separate, simpler eval loop with its own retrieval-specific scorer.
Possible Solutions¶
Plan¶
Implementation Progress¶
QA Exploration¶
- QA exploration completed (or N/A for non-UI tasks)
N/A for browser QA. Validation: run against a sample of benchmark cases, verify recall@k and MRR outputs match manual spot-checks.
Review Feedback¶
- Review cleared