Add eval loop for dashboard search and variable-scoped navigation¶

Problem¶

We are about to make dashboard search more powerful by returning variable-scoped deep links, but we have no disciplined way to measure whether those changes actually improve outcomes.

Basic retrieval metrics are not enough here. A search result can be "correct" at the dashboard title level and still fail the real task if:

it picks a dashboard that cannot express the requested scope through its variables
it resolves the wrong variable values
it silently drops a requested filter
it generates a broken or lossy deep link
it fails to explain partial or unsupported bindings

For this feature area, the real user task is:

"Given a natural-language request, can the system send me to the right dashboard already scoped to the right state?"

That means we need an eval loop that measures the full navigation handoff:

dashboard retrieval quality
variable-binding accuracy
deep-link generation correctness
partial-match behavior
regression trends over time

Without this, improvements to search_dashboards, ranking, binding inference, or URL serialization will be judged anecdotally and are likely to regress quietly.

Context¶

Relevant existing work: - master_plans/workstreams/mcp-analyst-agent/tasks/expand-dashboard-search-to-return-variable-scoped-deep-links.md — feature task this eval should guard - master_plans/workstreams/mcp-analyst-agent/tasks/add-catalog-discovery-evals-derived-from-sql-benchmark.md — existing retrieval-eval pattern - master_plans/workstreams/mcp-analyst-agent/tasks/task-m2-agent-eval-loop-v1.md — broader end-to-end agent eval framing - apps/evals/ — unified home for eval infrastructure and leaderboard dashboards

Relationship to the feature task: - the deep-link task defines the product behavior we want - this task exists to make that behavior measurable before and after ranking, binding, or URL-contract changes

Why catalog retrieval eval is not sufficient: - catalog eval asks "did we retrieve the right tables?" - this task asks "did we retrieve the right dashboard and encode the right dashboard state?"

Those are related but distinct problems. Dashboard search includes a retrieval layer plus a parameter-resolution layer.

Likely product boundary under test: - dataface/ai/mcp/search.py — dashboard search and ranking - dataface/ai/mcp/tools.py — tool-level result shape - dashboard variable metadata extraction / indexing - URL or route-state serialization for variable-scoped navigation - receiving dashboard route behavior in Cloud

Eval case shape we need: Each case should include: - user request text - expected dashboard slug or allowed dashboard set - expected variable bindings - optional allowed partial bindings - expected failure mode when full mapping is impossible

Example:

{
  "prompt": "Open the renewal dashboard for Acme for Q4 2025",
  "expected_dashboard": "renewal-risk",
  "expected_bindings": {
    "customer": "Acme",
    "quarter": "2025-Q4"
  },
  "allowed_partial": [],
  "notes": "Should not choose account-overview even though it mentions renewal"
}

Scoring needs to be multi-part: - top-k dashboard retrieval - exact dashboard hit - variable binding exact match - per-variable precision/recall - deep-link validity / round-trip decode - explanation quality for partial or unsupported cases

Constraints: - no hidden magic: eval should reward explicit, declared variable mappings only - support deterministic scoring where possible - do not require full browser QA for every run; the core loop should stay cheap enough for regular regression runs - reserve slower browser validation for a small smoke subset if needed

Possible Solutions¶

Option A: Standalone dashboard-search eval suite under `apps/evals/` [Recommended]¶

Create a dedicated eval package, likely apps/evals/dashboard_search/, with: - curated JSONL dataset - runner that calls dashboard search / deep-link resolution - deterministic scorer for retrieval + binding correctness - per-run summaries and leaderboard-ready outputs

Pros: Clean boundary, repeatable, cheap to run, directly comparable across search/ranking changes.

Cons: Requires creating a new eval corpus rather than reusing the SQL benchmark wholesale.

Option B: Fold into existing catalog retrieval eval¶

Extend catalog eval to also evaluate dashboard search and variable bindings.

Pros: Fewer top-level eval packages.

Cons: Wrong abstraction. Table retrieval and dashboard navigation have different ground truth, different scorers, and different failure modes.

Option C: Only use end-to-end agent evals¶

Rely on broad agent eval prompts and judge whether the final user experience feels right.

Pros: Closest to real usage.

Cons: Too slow and too noisy. Hard to pinpoint whether a regression came from retrieval, binding inference, link serialization, or dashboard rendering.

Plan¶

Recommended approach: add a dedicated dashboard-search eval loop under apps/evals/ with deterministic scoring for retrieval and variable resolution.

Dataset¶

Create a small but high-signal benchmark corpus for dashboard navigation requests: - start with 25-50 hand-authored cases - cover direct dashboard lookup, customer/entity scoping, time scoping, multi-variable requests, ambiguous prompts, and unsupported-filter cases - include negative/edge cases where the correct behavior is partial match or explicit failure

Store as JSONL under something like: - apps/evals/data/dashboard_search_cases.jsonl

Runner¶

Add a runner, likely: - apps/evals/dashboard_search/runner.py

Responsibilities: - load eval cases - call the dashboard search / deep-link resolution surface under test - capture returned dashboard candidates, bindings, reasons, and open_url - write per-case JSONL output under apps/evals/output/dashboard_search/

Scorer¶

Add deterministic scorer logic, likely: - apps/evals/dashboard_search/scorer.py

Suggested metrics: - dashboard_hit_at_k - dashboard_mrr - exact_dashboard_match_rate - binding_exact_match_rate - binding_field_precision - binding_field_recall - deep_link_valid_rate - partial_match_explanation_rate

For unsupported or impossible cases: - score success when the system explicitly reports the unresolved constraint instead of pretending it succeeded

Shared reporting¶

Emit leaderboard-ready summaries: - overall metrics - breakdown by case type: entity filter, time filter, multi-variable, ambiguity, unsupported - trendable run metadata so search/ranking changes can be compared over time

Optional browser smoke slice¶

For a tiny subset of cases, add an optional smoke check that: - opens the returned URL - verifies the dashboard route loads - verifies applied variables are visible in the UI

This should be a small follow-on or optional mode, not required for the main cheap loop.

Files likely involved¶

apps/evals/dashboard_search/runner.py
apps/evals/dashboard_search/scorer.py
apps/evals/dashboard_search/types.py
apps/evals/data/dashboard_search_cases.jsonl
apps/evals/output/dashboard_search/ (gitignored output)
tests/evals/dashboard_search/
unified CLI wiring in apps/evals/__main__.py

Implementation steps¶

Define the eval case schema and write the initial benchmark corpus.
Build a runner around the current dashboard search surface.
Build deterministic scoring for dashboard retrieval, variable binding, and deep-link validity.
Add CLI entrypoint and output summaries.
Add focused tests for scorer edge cases and case parsing.
Optionally add a small browser-smoke mode once the receiving dashboard route behavior stabilizes.

Relationship to the feature task¶

This task is the regression harness for: - expand-dashboard-search-to-return-variable-scoped-deep-links.md

The feature task changes behavior. This eval task makes that behavior measurable and keeps it from drifting.

Implementation Progress¶

Not started.

QA Exploration¶

QA exploration completed (or N/A for non-UI tasks)

N/A for primary implementation. Main validation should be deterministic runner/scorer tests plus manual spot-checks of a few representative cases. If a browser smoke slice is added later, validate only a small curated subset.

Review Feedback¶

Review cleared