Skip to content

Declarative schema definition outside Python code

Problem

The Dataface YAML spec is implicitly defined by Python Pydantic models spread across types.py, compiled_types.py, and query_types.py. There is no single, language-agnostic document that says "this is what valid Dataface YAML looks like." This creates multiple problems:

  1. No versioning anchor: Without a formal schema, there's no way to diff schema versions or write automated migrations (see task-m2-yaml-version-migrations).
  2. Hand-maintained AI prompts: schema.py contains hand-written markdown summaries that drift from the actual types.
  3. No editor integration: VS Code/Cursor can't provide YAML autocompletion or validation without a JSON Schema or equivalent.
  4. No extensibility contract: Users can't register custom chart types because there's no schema to extend.
  5. No external tooling: Non-Python tools (IDE extensions, CI validators, documentation generators) can't consume the spec definition.

json-render solved this with defineCatalog — a single typed definition that powers validation, AI prompts, editor tooling, and the renderer simultaneously.

Context

Possible Solutions

A. JSON Schema generated from Pydantic models

Use Pydantic's built-in model_json_schema() to export JSON Schema, then treat the exported schema as the canonical artifact. Enrich with json_schema_extra for AI descriptions and editor hints.

Pros: Minimal new code. Pydantic does the heavy lifting. JSON Schema is widely supported (VS Code YAML extension, ajv, etc.). Can start immediately. Cons: JSON Schema is verbose and hard to read. The "source of truth" is still the Python code — the schema is a derived artifact. Some Dataface concepts (layout unions, chart type polymorphism) are awkward in JSON Schema.

Define a Dataface-native schema format in YAML that describes all element types, their fields, types, defaults, descriptions, and constraints. Generate Pydantic models and JSON Schema from this definition.

version: "1.0"
elements:
  chart:
    description: "A data visualization"
    fields:
      type:
        type: enum
        values: [bar, line, area, scatter, pie, donut, kpi, table, ...]
        description: "Chart visualization type"
      query:
        type: string | inline_query
        required: true
        description: "Query name or inline query definition"
      x:
        type: field_name
        description: "Field for x-axis encoding"
      y:
        type: field_name
        description: "Field for y-axis encoding"
      # ...

Pros: Human-readable. Easy to version/diff. Can include AI-specific metadata (descriptions, examples, common mistakes). Single source of truth. Generates both Python models and JSON Schema. Cons: Requires building a schema-to-code generator. New abstraction to maintain. Must be kept in sync — but that's the point (it IS the source, code is derived).

C. Pydantic as source of truth, with enriched metadata

Keep Pydantic models as the canonical definition, but add rich Field(description=..., json_schema_extra=...) metadata. Export JSON Schema for editors, export AI prompts via introspection.

Pros: No new format. Incremental improvement. Metadata lives next to the code it describes. Cons: Python remains the source of truth — non-Python tools must parse exported JSON Schema. Extensibility is harder (users would need to write Python to extend).

Plan

Implementation Progress

Review Feedback

  • Review cleared