Curate schema and table scope for eval benchmark¶

Problem¶

Decide which schemas, tables, and data layers (raw, silver/staging, gold/marts) to include in the eval scope and catalog context. The analytics warehouse has raw, staging, and transform layers — including everything adds noise, but restricting to only the gold layer may hurt agent performance by hiding useful context. Run experiments comparing agent quality across different table scope configurations. Produce a curated allowlist and document the reasoning. This feeds back into both the eval benchmark filtering and the catalog context the agent sees in production.

Context¶

The problem with "just use the gold layer"¶

A typical dbt warehouse has three layers: - Raw/staging (stg_*) — direct copies from sources, minimally transformed - Silver/intermediate (int_*) — cleaned, joined, business logic applied - Gold/marts (fct_*, dim_*) — curated analytical models, the "right" tables for analysts

The instinct is to only expose gold to the agent. But this may hurt more than help: - Analysts often query staging tables for data that hasn't been modeled yet - Raw tables contain columns and values that gold tables abstract away - Some questions genuinely need raw data (debugging, freshness checks, data quality) - Restricting to gold reduces the table count the agent sees (less noise) but also reduces its ability to answer questions that don't fit the gold model

The benchmark dataset (from cto-research) uses schemas from open-source dbt packages. These have their own layer conventions. The Fivetran analytics warehouse has its own. Both need curation.

This should be done together with the ablation work¶

The "right" scope is an empirical question, not a design decision. Do not make this a separate upfront decision that later gets "validated." Run it together with the context/schema-tool ablation work:

layer scope (gold-only vs gold+silver vs all)
schema tool choice (profiled catalog vs filtered catalog vs live INFORMATION_SCHEMA vs no tool)
field filtering within a tool (names only vs +types vs +descriptions vs +profile stats vs +sample values)

Then produce the allowlist based on measured quality and latency, not intuition.

Deliverables¶

Inventory the analytics warehouse schemas/tables with layer tags
Run layer-scope + schema-tool experiments using the eval system
Produce a curated allowlist with documented reasoning
Apply the allowlist to both the eval benchmark (filter cases) and the catalog context/tool defaults (filter tables and context fields)

Dependencies¶

Depends on having a working analytics warehouse inspection path (the consolidated analytics repo + BigQuery bootstrap work), so we can inspect the real internal table landscape
Depends on the eval runner (task 3) to measure impact of different scopes
Results feed back into the ablation experiments task

Curate schema and table scope for eval benchmark¶

Problem¶

Context¶

The problem with "just use the gold layer"¶

This should be done together with the ablation work¶

Deliverables¶

Dependencies¶

Possible Solutions¶

Plan¶

Implementation Progress¶

QA Exploration¶

Review Feedback¶