Add persistent analyst memories and learned context¶

Problem¶

Design and implement a memories file that accumulates knowledge from analyst queries — table quirks, column semantics, common join patterns, gotchas, and domain-specific conventions. Currently the Dataface AI system prompt is fully static (schema context + skills + yaml cheatsheet). There is no mechanism to learn from past interactions. Evaluate different memory strategies (append-only log, structured knowledge base, embedding-indexed retrieval) and measure their impact on SQL quality using the eval system. The goal is that the agent gets better at a specific warehouse over time, not just on each conversation.

Context¶

Current state: fully static system prompt¶

The Dataface AI system prompt is assembled in dataface/ai/agent.py:build_agent_system_prompt(): - Workflow skill (prompts/workflow.md) - Dashboard design skill (prompts/dashboard_design.md) - Dashboard review skill - YAML cheatsheet - Schema context from get_schema_context() - Tool guidance

There is zero per-project or per-warehouse learned context. Every conversation starts from scratch. The agent has no memory of: - Table quirks ("the created_at column in stg_salesforce__accounts is actually modified_at") - Column semantics ("revenue means ARR, not MRR, in this warehouse") - Common join patterns ("always join opportunities through accounts, not directly to contacts") - Gotchas ("the is_deleted soft-delete flag must be filtered in staging tables") - Domain conventions ("fiscal year starts in February")

Why this matters¶

An analyst who uses the same warehouse daily builds up this knowledge implicitly. An AI agent should too. The memories file is the mechanism — a persistent, growing knowledge base that gets injected into the system prompt alongside schema context.

Research: how production systems handle agent memory¶

The industry has converged on a multi-tier memory architecture modeled on human cognition:

Tier 1: Working memory (context window) — the current conversation. Already exists.

Tier 2: Semantic/procedural memory (knowledge base) — generalized facts, rules, and learned skills. Persists across sessions. This is what we need.

Tier 3: Episodic memory (event logs) — timestamped records of specific interactions. Useful for "what did we do last time?" questions. Optional for now.

Key systems to study before implementing:

Mem0 (mem0.ai) — pluggable memory layer that integrates with any agent. Extracts structured facts from conversations, stores as embeddings + optional knowledge graph. 48K GitHub stars, $24M Series A. Key insight: extract and structure facts rather than storing raw conversations. Newer info timestamp-weights over duplicates.
Letta/MemGPT — full agent runtime with three-tier memory: Core (in-context RAM), Recall (conversation history), Archival (cold storage). Agents actively self-edit their own memory blocks. 83% on LoCoMo benchmark. Key insight: the agent should manage its own memory, not have memory passively extracted.
Cursor's approach — uses .cursor/rules/ directory with structured .mdc files organized by domain. Project-level, not conversation-level. This is closest to what we'd want: a per-project knowledge base that's part of the repo.
Claude-mem — auto-captures tool usage, generates semantic summaries. 95% token savings across sessions. Key insight: selective forgetting matters — don't try to retain everything.

What the research says matters most: 1. Extract structured facts, not raw transcripts 2. Selective retrieval — inject only relevant memories, not everything 3. Timestamp and freshness — newer knowledge overrides older 4. The agent should be able to write/update its own memories 5. Human memory works by selective forgetting — production systems should too

Strategy recommendation: structured YAML, graduating to retrieval¶

Start with option 2 (structured YAML) — a memories.yml file in the dft project (alongside dataface.yml). Typed entries with tags:

memories:
  - type: column_semantic
    table: bi_core.accounts
    column: revenue
    note: "This is ARR, not MRR"
    created: 2026-03-17

  - type: join_pattern
    tables: [bi_core.opportunities, bi_core.accounts]
    note: "Always join opportunities through accounts, not directly to contacts"
    created: 2026-03-17

  - type: gotcha
    table: staging.stg_salesforce__accounts
    note: "created_at is actually modified_at — use _fivetran_synced for true creation"
    created: 2026-03-17

  - type: domain
    note: "Fiscal year starts in February"
    created: 2026-03-17

Why YAML over markdown: barely harder, but gives structure for filtering. Can always be dumped verbatim into the prompt as a starting approach. Makes it trivial to auto-generate memories from inspector output.

Graduate to retrieval when the file gets large (>100 entries or >4K tokens). At that point, embed the entries and retrieve top-k relevant ones per question. Mem0 or a simple local embedding index. But don't build this until you've proven memories help at all via eval experiments.

How memories get created¶

Manually by analysts ("remember that fiscal year starts in February" → agent writes to memories.yml)
Automatically from self-corrections (agent corrects a query → captures the correction as a memory)
From inspector/profile output (semantic types, distributions that are non-obvious)
The agent should have a tool to read/write memories — a remember MCP tool

Dependencies¶

Depends on the eval system being operational (runner + benchmark) to measure impact of different memory strategies
The memory injection point is build_agent_system_prompt() in dataface/ai/agent.py

Possible Solutions¶

Plan¶

Implementation Progress¶

QA Exploration¶

QA exploration completed (or N/A for non-UI tasks)

N/A — this is core AI infrastructure, not UI. Validation is via eval experiments measuring memory impact.

Review Feedback¶

Review cleared