Set up eval leaderboard dft project and dashboards¶
Problem¶
Create a dft project inside the eval output directory with dashboard faces that visualize eval results as a leaderboard. Compare models, prompt versions, and context configurations side-by-side. Serve via dft serve so results are browsable locally after any eval run. This is the concrete deliverable for task 5 (persist eval outputs) — not a persistence layer, but actual Dataface dashboards over eval JSONL.
Context¶
What this replaces¶
Task 5 (persist eval outputs) was originally a separate task framed as a persistence layer. That task is now cancelled and merged here. The "persistence" is trivial — JSONL files on disk. The real deliverable is dashboards.
Unified eval directory structure¶
All eval code lives under apps/evals/ — one home for all eval types, one CLI, one leaderboard:
apps/evals/
├── __main__.py # unified CLI: python -m apps.evals {sql,catalog,agent}
├── dataface.yml # DuckDB source over output/ JSONL
├── faces/ # leaderboard dashboards (all eval types)
│ ├── overview.yml # cross-type summary (pass rates per eval type)
│ ├── sql-leaderboard.yml
│ ├── catalog-leaderboard.yml
│ ├── agent-leaderboard.yml
│ └── failure-analysis.yml
├── data/ # benchmark artifacts (checked in)
│ ├── benchmark.jsonl # cleaned dbt SQL benchmark
│ └── canary.jsonl # stratified subset for fast runs
├── output/ # eval results (gitignored)
│ ├── sql/ # text-to-SQL results
│ ├── catalog/ # catalog discovery results
│ └── agent/ # agent eval results (screenshots, scores)
├── sql/ # text-to-SQL eval code
│ ├── runner.py
│ ├── scorer.py
│ ├── backends.py # factory functions (make_raw_llm_backend, etc.)
│ └── types.py # BenchmarkCase, GenerationResult, GenerateFn
├── catalog/ # catalog discovery eval code
│ ├── runner.py
│ ├── scorer.py # IR metrics (recall@k, MRR)
│ └── prep.py # extract expected tables from gold SQL
├── agent/ # agent/dashboard eval code (migrated from apps/a_lie/)
│ ├── runner.py # generate dashboard from prompt
│ ├── screenshotter.py # capture rendered output
│ ├── reviewer.py # vision LLM scoring
│ ├── rubric.md
│ └── prompts/ # curated eval prompts
└── shared/ # shared across eval types
├── prep.py # benchmark cleaning script
└── reporting.py # breakdown/aggregation helpers
The agent eval code (runner.py, screenshotter.py, reviewer.py, rubric.md) migrates from apps/a_lie/ — the A lIe app stays where it is as the demo app, but the eval infrastructure moves to its proper home in apps/evals/agent/.
Data flow¶
- Eval runners write results to
apps/evals/output/{sql,catalog,agent}/(gitignored). apps/evals/dataface.ymldeclares a DuckDB source. Dashboard queries useread_json_auto()to read JSONL from the output subdirs. This should work the same way file-backed CSV workflows already do. If a small adapter/helper gap shows up around JSONL, close it inside this task rather than treating it as a separate blocker.- Dashboard faces in
apps/evals/faces/query across all eval types. dft servefromapps/evals/renders the unified leaderboard.
If eval results need to be shared beyond the local machine, that's M2 scope.
Unified CLI¶
python -m apps.evals sql --backend raw_llm --model gpt-4o ...
python -m apps.evals catalog --limit 100 ...
python -m apps.evals agent --prompts apps/evals/agent/prompts/ ...
python -m apps.evals agent --prompt "show me revenue by region" ...
Reporting dimensions¶
SQL evals: backend, backend_metadata (model, provider, context level), schema, complexity, category.
Catalog evals: retrieval method, k value, schema.
Agent evals: model, prompt, overall/narrative/yaml/visual scores.
Dashboard ideas¶
- Overview — cross-type summary: SQL pass rate, catalog recall@10, agent average score. One page to see overall AI quality.
- SQL leaderboard — pass rate by backend/model, sortable by dimension.
- SQL failure analysis — table of failed cases with question, gold SQL, generated SQL, failure reason.
- Catalog leaderboard — recall@k and MRR by retrieval method.
- Agent leaderboard — overall score by model, with per-prompt drill-down.
- Context impact — lift per context level across SQL and agent evals.
- Schema difficulty — which schemas are hardest? Pass rate by schema, sorted.
Dependencies¶
- Depends on task 3 (SQL eval runner) for the SQL output schema.
- Does not need to wait on the analytics repo bootstrap flow.
apps/evals/can be created directly in this repo; the analytics repo remains the canonicaldft initproving ground. - DuckDB/file-backed querying already exists in Dataface. If JSONL needs one small missing piece, add it here.
- Agent eval dashboards depend on the A lIe eval migration (M2 agent eval task).
Possible Solutions¶
Plan¶
Implementation Progress¶
QA Exploration¶
- QA exploration completed — verify dashboards render via
dft servewith sample eval output
Review Feedback¶
- Review cleared