Run context and model ablation experiments¶
Problem¶
Define and execute the initial experiment matrix using the eval system. Compare models (GPT-4o, GPT-5, Claude Sonnet, etc.), schema-tool strategies, context levels (table/column names only, +types, +descriptions, +profile stats, +sample values), and with/without catalog tool access. Measure which context fields actually improve SQL quality and which are noise. Capture results in the eval leaderboard dashboards. This is where the eval system proves its value — the experiments are the point.
Context¶
This is a planning task, not a giant one-off execution task¶
This task defines the experiment matrix and spawns individual experiment tasks. Each experiment gets its own task file using the experiment worksheet from the run-experiment skill (.codex/skills/run-experiment/SKILL.md). This keeps a clean log of hypothesis, method, results, and conclusions per experiment.
It also owns the decision to run schema-scope curation together with context/tool ablations, not as a separate disconnected track.
Per-experiment task pattern¶
When ready to run an experiment:
- Create a task:
just plan task create --workstream mcp-analyst-agent --title "Experiment: <description>" --initiative ai-quality-experimentation-and-context-optimization - Replace the generated body with the experiment worksheet template from the
run-experimentskill. - Fill in hypothesis and method before running.
- Execute, analyze, conclude.
Experiment matrix (planned)¶
Model comparison: - GPT-4o vs GPT-5 vs Claude Sonnet — same context, same prompt, same benchmark subset - One task per model pair comparison
Context field ablation: - L0: no schema tool / no preloaded context - L1: table + column names only - L2: L1 + types - L3: L2 + descriptions (from dbt schema.yml) - L4: L3 + profile stats (row counts, distributions, nulls) - L5: L4 + sample/top values - Each level is a run against the canary set; start with L0 vs L1 vs L3 vs L5 to find the big jumps
Schema tool strategy:
- Profiled catalog tool (full fields)
- Profiled catalog tool with filtered fields
- Live INFORMATION_SCHEMA path
- Checked-in memory file / snapshot
- No tool
This is a core question, not an implementation detail. The eval system should tell us whether richer profiling helps enough to justify its cost and complexity.
Catalog tool access: - With vs without the catalog/schema tool entirely - Preloaded context only vs on-demand tool use
Layer scope (run together with schema curation task): - All tables vs gold-only vs gold+silver — does seeing raw/staging tables help or hurt? - Measure both quality and latency/token cost, not just pass rate
What not to do yet¶
- Do not turn M1 into a full regression-suite project.
- Broad automated eval gates can wait until M4 once the benchmark, scorer, and experiment matrix stabilize.
- For now, prioritize canary runs and targeted experiment tasks that answer product questions.
Dependencies¶
- Depends on eval runner (task 3), cleaned benchmark (task 2), extracted generate_sql function, and leaderboard dashboards being operational.
- The schema curation task is closely related — layer scope experiments feed into curation decisions.
Possible Solutions¶
Plan¶
Implementation Progress¶
Experiments will be logged as individual tasks. Link them here as they're created:
- Experiment: Model comparison (GPT-4o vs Claude Sonnet)
- Experiment: Context ablation (L0 vs L1 vs L3 vs L5)
- Experiment: Schema tool strategy (profiled vs filtered vs information_schema vs none)
- Experiment: With vs without catalog tool
- Experiment: Layer scope (all vs gold-only)
QA Exploration¶
- QA exploration completed (or N/A for non-UI tasks)
N/A — this is a planning and coordination task.
Review Feedback¶
- Review cleared