Set up eval leaderboard dft project and dashboards¶

Problem¶

Create a dft project inside the eval output directory with dashboard faces that visualize eval results as a leaderboard. Compare models, prompt versions, and context configurations side-by-side. Serve via dft serve so results are browsable locally after any eval run. This is the concrete deliverable for task 5 (persist eval outputs) — not a persistence layer, but actual Dataface dashboards over eval JSONL.

Context¶

What this replaces¶

Task 5 (persist eval outputs) was originally a separate task framed as a persistence layer. That task is now cancelled and merged here. The "persistence" is trivial — JSONL files on disk. The real deliverable is dashboards.

Unified eval directory structure¶

All eval code lives under apps/evals/ — one home for all eval types, one CLI, one leaderboard:

apps/evals/
├── __main__.py           # unified CLI: python -m apps.evals {sql,catalog,agent}
├── dataface.yml          # DuckDB source over output/ JSONL
├── faces/                # leaderboard dashboards (all eval types)
│   ├── overview.yml      # cross-type summary (pass rates per eval type)
│   ├── sql-leaderboard.yml
│   ├── catalog-leaderboard.yml
│   ├── agent-leaderboard.yml
│   └── failure-analysis.yml
├── data/                 # benchmark artifacts (checked in)
│   ├── benchmark.jsonl   # cleaned dbt SQL benchmark
│   └── canary.jsonl      # stratified subset for fast runs
├── output/               # eval results (gitignored)
│   ├── sql/              # text-to-SQL results
│   ├── catalog/          # catalog discovery results
│   └── agent/            # agent eval results (screenshots, scores)
├── sql/                  # text-to-SQL eval code
│   ├── runner.py
│   ├── scorer.py
│   ├── backends.py       # factory functions (make_raw_llm_backend, etc.)
│   └── types.py          # BenchmarkCase, GenerationResult, GenerateFn
├── catalog/              # catalog discovery eval code
│   ├── runner.py
│   ├── scorer.py         # IR metrics (recall@k, MRR)
│   └── prep.py           # extract expected tables from gold SQL
├── agent/                # agent/dashboard eval code (migrated from apps/a_lie/)
│   ├── runner.py         # generate dashboard from prompt
│   ├── screenshotter.py  # capture rendered output
│   ├── reviewer.py       # vision LLM scoring
│   ├── rubric.md
│   └── prompts/          # curated eval prompts
└── shared/               # shared across eval types
    ├── prep.py           # benchmark cleaning script
    └── reporting.py      # breakdown/aggregation helpers

The agent eval code (runner.py, screenshotter.py, reviewer.py, rubric.md) migrates from apps/a_lie/ — the A lIe app stays where it is as the demo app, but the eval infrastructure moves to its proper home in apps/evals/agent/.

Data flow¶

Eval runners write results to apps/evals/output/{sql,catalog,agent}/ (gitignored).
apps/evals/dataface.yml declares a DuckDB source. Dashboard queries use read_json_auto() to read JSONL from the output subdirs. This should work the same way file-backed CSV workflows already do. If a small adapter/helper gap shows up around JSONL, close it inside this task rather than treating it as a separate blocker.
Dashboard faces in apps/evals/faces/ query across all eval types.
dft serve from apps/evals/ renders the unified leaderboard.

If eval results need to be shared beyond the local machine, that's M2 scope.

Unified CLI¶

python -m apps.evals sql --backend raw_llm --model gpt-4o ...
python -m apps.evals catalog --limit 100 ...
python -m apps.evals agent --prompts apps/evals/agent/prompts/ ...
python -m apps.evals agent --prompt "show me revenue by region" ...

Reporting dimensions¶

SQL evals: backend, backend_metadata (model, provider, context level), schema, complexity, category. Catalog evals: retrieval method, k value, schema. Agent evals: model, prompt, overall/narrative/yaml/visual scores.

Dashboard ideas¶

Overview — cross-type summary: SQL pass rate, catalog recall@10, agent average score. One page to see overall AI quality.
SQL leaderboard — pass rate by backend/model, sortable by dimension.
SQL failure analysis — table of failed cases with question, gold SQL, generated SQL, failure reason.
Catalog leaderboard — recall@k and MRR by retrieval method.
Agent leaderboard — overall score by model, with per-prompt drill-down.
Context impact — lift per context level across SQL and agent evals.
Schema difficulty — which schemas are hardest? Pass rate by schema, sorted.

Dependencies¶

Depends on task 3 (SQL eval runner) for the SQL output schema.
Does not need to wait on the analytics repo bootstrap flow. apps/evals/ can be created directly in this repo; the analytics repo remains the canonical dft init proving ground.
DuckDB/file-backed querying already exists in Dataface. If JSONL needs one small missing piece, add it here.
Agent eval dashboards depend on the A lIe eval migration (M2 agent eval task).

Possible Solutions¶

Plan¶

Implementation Progress¶

QA Exploration¶

QA exploration completed — verify dashboards render via dft serve with sample eval output

Review Feedback¶

Review cleared