Add eval loop for dashboard search and variable-scoped navigation¶
Problem¶
We are about to make dashboard search more powerful by returning variable-scoped deep links, but we have no disciplined way to measure whether those changes actually improve outcomes.
Basic retrieval metrics are not enough here. A search result can be "correct" at the dashboard title level and still fail the real task if:
- it picks a dashboard that cannot express the requested scope through its variables
- it resolves the wrong variable values
- it silently drops a requested filter
- it generates a broken or lossy deep link
- it fails to explain partial or unsupported bindings
For this feature area, the real user task is:
"Given a natural-language request, can the system send me to the right dashboard already scoped to the right state?"
That means we need an eval loop that measures the full navigation handoff:
- dashboard retrieval quality
- variable-binding accuracy
- deep-link generation correctness
- partial-match behavior
- regression trends over time
Without this, improvements to search_dashboards, ranking, binding inference, or URL serialization will be judged anecdotally and are likely to regress quietly.
Context¶
Relevant existing work:
- master_plans/workstreams/mcp-analyst-agent/tasks/expand-dashboard-search-to-return-variable-scoped-deep-links.md — feature task this eval should guard
- master_plans/workstreams/mcp-analyst-agent/tasks/add-catalog-discovery-evals-derived-from-sql-benchmark.md — existing retrieval-eval pattern
- master_plans/workstreams/mcp-analyst-agent/tasks/task-m2-agent-eval-loop-v1.md — broader end-to-end agent eval framing
- apps/evals/ — unified home for eval infrastructure and leaderboard dashboards
Relationship to the feature task: - the deep-link task defines the product behavior we want - this task exists to make that behavior measurable before and after ranking, binding, or URL-contract changes
Why catalog retrieval eval is not sufficient: - catalog eval asks "did we retrieve the right tables?" - this task asks "did we retrieve the right dashboard and encode the right dashboard state?"
Those are related but distinct problems. Dashboard search includes a retrieval layer plus a parameter-resolution layer.
Likely product boundary under test:
- dataface/ai/mcp/search.py — dashboard search and ranking
- dataface/ai/mcp/tools.py — tool-level result shape
- dashboard variable metadata extraction / indexing
- URL or route-state serialization for variable-scoped navigation
- receiving dashboard route behavior in Cloud
Eval case shape we need: Each case should include: - user request text - expected dashboard slug or allowed dashboard set - expected variable bindings - optional allowed partial bindings - expected failure mode when full mapping is impossible
Example:
{
"prompt": "Open the renewal dashboard for Acme for Q4 2025",
"expected_dashboard": "renewal-risk",
"expected_bindings": {
"customer": "Acme",
"quarter": "2025-Q4"
},
"allowed_partial": [],
"notes": "Should not choose account-overview even though it mentions renewal"
}
Scoring needs to be multi-part: - top-k dashboard retrieval - exact dashboard hit - variable binding exact match - per-variable precision/recall - deep-link validity / round-trip decode - explanation quality for partial or unsupported cases
Constraints: - no hidden magic: eval should reward explicit, declared variable mappings only - support deterministic scoring where possible - do not require full browser QA for every run; the core loop should stay cheap enough for regular regression runs - reserve slower browser validation for a small smoke subset if needed
Possible Solutions¶
Option A: Standalone dashboard-search eval suite under apps/evals/ [Recommended]¶
Create a dedicated eval package, likely apps/evals/dashboard_search/, with:
- curated JSONL dataset
- runner that calls dashboard search / deep-link resolution
- deterministic scorer for retrieval + binding correctness
- per-run summaries and leaderboard-ready outputs
Pros: Clean boundary, repeatable, cheap to run, directly comparable across search/ranking changes.
Cons: Requires creating a new eval corpus rather than reusing the SQL benchmark wholesale.
Option B: Fold into existing catalog retrieval eval¶
Extend catalog eval to also evaluate dashboard search and variable bindings.
Pros: Fewer top-level eval packages.
Cons: Wrong abstraction. Table retrieval and dashboard navigation have different ground truth, different scorers, and different failure modes.
Option C: Only use end-to-end agent evals¶
Rely on broad agent eval prompts and judge whether the final user experience feels right.
Pros: Closest to real usage.
Cons: Too slow and too noisy. Hard to pinpoint whether a regression came from retrieval, binding inference, link serialization, or dashboard rendering.
Plan¶
Recommended approach: add a dedicated dashboard-search eval loop under apps/evals/ with deterministic scoring for retrieval and variable resolution.
Dataset¶
Create a small but high-signal benchmark corpus for dashboard navigation requests: - start with 25-50 hand-authored cases - cover direct dashboard lookup, customer/entity scoping, time scoping, multi-variable requests, ambiguous prompts, and unsupported-filter cases - include negative/edge cases where the correct behavior is partial match or explicit failure
Store as JSONL under something like:
- apps/evals/data/dashboard_search_cases.jsonl
Runner¶
Add a runner, likely:
- apps/evals/dashboard_search/runner.py
Responsibilities:
- load eval cases
- call the dashboard search / deep-link resolution surface under test
- capture returned dashboard candidates, bindings, reasons, and open_url
- write per-case JSONL output under apps/evals/output/dashboard_search/
Scorer¶
Add deterministic scorer logic, likely:
- apps/evals/dashboard_search/scorer.py
Suggested metrics:
- dashboard_hit_at_k
- dashboard_mrr
- exact_dashboard_match_rate
- binding_exact_match_rate
- binding_field_precision
- binding_field_recall
- deep_link_valid_rate
- partial_match_explanation_rate
For unsupported or impossible cases: - score success when the system explicitly reports the unresolved constraint instead of pretending it succeeded
Shared reporting¶
Emit leaderboard-ready summaries: - overall metrics - breakdown by case type: entity filter, time filter, multi-variable, ambiguity, unsupported - trendable run metadata so search/ranking changes can be compared over time
Optional browser smoke slice¶
For a tiny subset of cases, add an optional smoke check that: - opens the returned URL - verifies the dashboard route loads - verifies applied variables are visible in the UI
This should be a small follow-on or optional mode, not required for the main cheap loop.
Files likely involved¶
apps/evals/dashboard_search/runner.pyapps/evals/dashboard_search/scorer.pyapps/evals/dashboard_search/types.pyapps/evals/data/dashboard_search_cases.jsonlapps/evals/output/dashboard_search/(gitignored output)tests/evals/dashboard_search/- unified CLI wiring in
apps/evals/__main__.py
Implementation steps¶
- Define the eval case schema and write the initial benchmark corpus.
- Build a runner around the current dashboard search surface.
- Build deterministic scoring for dashboard retrieval, variable binding, and deep-link validity.
- Add CLI entrypoint and output summaries.
- Add focused tests for scorer edge cases and case parsing.
- Optionally add a small browser-smoke mode once the receiving dashboard route behavior stabilizes.
Relationship to the feature task¶
This task is the regression harness for:
- expand-dashboard-search-to-return-variable-scoped-deep-links.md
The feature task changes behavior. This eval task makes that behavior measurable and keeps it from drifting.
Implementation Progress¶
Not started.
QA Exploration¶
- QA exploration completed (or N/A for non-UI tasks)
N/A for primary implementation. Main validation should be deterministic runner/scorer tests plus manual spot-checks of a few representative cases. If a browser smoke slice is added later, validate only a small curated subset.
Review Feedback¶
- Review cleared