Skip to content

Limit the schema sent to the LLM (schema.yaml)

When datasight answers a question, it puts your database schema into the LLM's system prompt — every table, every column, every data type. On small projects this is fine. On a database with hundreds of tables or tables with hundreds of columns, it becomes a problem:

  • The prompt blows past the model's context window (especially on free tiers — see the GitHub Models warning below).
  • Even when it fits, large prompts are slower and more expensive per turn.
  • The LLM has a harder time picking the right table when 130 irrelevant ones are competing for attention.

schema.yaml is a project-level file that tells datasight which tables (and optionally which columns within each table) to expose to the LLM. Everything not listed is hidden from prompts and from the web UI's table list.

Quick start

Create a schema.yaml in your project directory alongside queries.yaml and time_series.yaml:

tables:
  - name: orders
  - name: customers
  - name: line_items

Reload the project (or restart datasight web) and only those three tables will be visible to the LLM and the UI. Tables in the database that aren't listed are dropped — no warnings, no errors, they simply don't exist as far as datasight is concerned for this session.

Column filtering

Two optional fields, pick whichever fits the table:

excluded_columns — glob denylist (drift-friendly)

Most tables only need a few columns hidden. Add excluded_columns with fnmatch-style patterns (*, ?, [abc]):

tables:
  - name: orders
    excluded_columns: [internal_notes, debug_flag]
  - name: sensor_readings
    excluded_columns:
      - "debug_*"           # any column starting with debug_
      - "sensor_[1-9]*"     # sensors 10 and up
      - "_internal_*"

Every column the DB reports is exposed except those matching a pattern. When the DB gains a new column, it appears automatically — no action needed unless you want to hide it.

datasight generate writes excluded_columns: [] per table as a placeholder so you can see where to add patterns.

columns — exact allowlist (strict)

For tables where the schema is wide and stable and you want exactly a known subset, use columns:

tables:
  - name: wide_measurements
    columns: [timestamp, sensor_id, value]

Only the listed columns are exposed, in the order given. Put the most important columns first — LLMs pay more attention to what comes earlier. Unknown names are dropped with a warning. New DB columns do not appear until you add them to the list — this is the point of allowlist mode.

Rules

  • Neither field → every column is exposed.
  • columns only → exact allowlist.
  • excluded_columns only → denylist with globs.
  • Both set (non-empty)columns wins and excluded_columns is ignored for that table, with a warning. One mode per table; if you want "strict minus one", just take that one out of columns.
  • Empty list (columns: [] or excluded_columns: []) → behaves the same as omitting the field.

Column filtering is done in-memory after introspection — the underlying SQL runner can still access any column, so queries the LLM emits against hidden columns would still execute. The point is to prevent the LLM from knowing those columns exist so it doesn't try.

Detecting schema drift

Every time a project loads, datasight logs one line per table whose filter hides something in the live DB:

schema.yaml: Orders exposing 4 of 7 columns; 3 hidden: created_at, promo_code, loyalty_tier

For excluded_columns, drift shows up as a shrinking "exposed" count relative to "total" — a new column the DB added isn't hidden (since it doesn't match the patterns), but this line tells you the table grew. For columns (strict allowlist), drift shows up as new DB columns appearing in the "hidden" preview — list them in columns if you want them exposed.

Tables whose filter hides nothing log no line — no noise when there's no news.

Auto-generated by datasight generate

datasight generate writes schema.yaml listing the selected tables, each with an empty excluded_columns: [] placeholder:

tables:
  - name: orders
    excluded_columns: []
  - name: customers
    excluded_columns: []

The workflow: open the file and either fill in globs under excluded_columns: to hide columns, or (for a few wide tables) convert that line to columns: [...] with an explicit allowlist.

  • datasight generate --table orders --table customers emits those two tables.
  • datasight generate (no --table) emits every table in the database.
  • datasight generate --compact-schema drops the excluded_columns placeholder, emitting just table names. Use this when you expect to do table-level filtering only and don't want the placeholder clutter.

Re-running generate --overwrite rewrites schema.yaml from scratch, discarding any hand-edits. Same caveat as with queries.yaml — version your project directory.

When to use this

Situation What to do
Database has < 20 small tables You don't need schema.yaml.
Database has 50+ tables, but only a handful are relevant List just those tables.
A few tables have 50+ columns each Add column filters for those tables.
Using GitHub Models or another ~8k-token-limited backend You almost certainly need this. See the warning below.
Prompt works but responses are slow or hallucinate tables Try narrowing first — often a bigger quality win than a better model.

GitHub Models and the 8,000-token limit

The free GitHub Models tier caps input+output tokens at 8,000 per request. The default datasight system prompt (instructions + tools + conversation history) already takes 1,500–2,500 tokens before your schema is added. That leaves roughly 5,500 tokens for schema + question + SQL + results.

A table with 40 columns and reasonable names uses ~200–400 schema tokens. Twenty such tables are enough to break GitHub Models, with no room left for the actual conversation. 131 tables will fail every time.

If you see errors like maximum context length exceeded or request too large for gpt-4o-mini on tier free, you have two options:

  1. Use schema.yaml to limit the schema (this page).
  2. Switch to a backend with a larger context window — OpenAI (128k), Anthropic (200k), or a local Ollama model (typically 8k–32k, check the specific model). See Choosing an LLM.

Option 1 often improves answer quality regardless of model, so try it first.

Warning: wide tables and the long-format alternative

If a single table has dozens or hundreds of columns — especially if most of those columns are variants of the same measurement (e.g. sales_2020, sales_2021, sales_2022, ... or temp_sensor_01, temp_sensor_02, ...) — the schema alone can exceed the token budget even with table filtering.

This is called wide format, and LLMs struggle with it:

  • Column names carry semantic information the LLM has to re-parse every turn (sales_q3_2024 = "quarter 3 of 2024 sales").
  • Questions like "which year had the highest sales?" require the LLM to generate a UNION ALL or a VALUES unpivot — error-prone and verbose.
  • Adding a new year means the LLM has to relearn the schema.

Prefer long (narrow) format — one column per kind of thing, plus a row per observation:

product sales_2022 sales_2023 sales_2024
A 100 150 180
B 80 90 120
product year sales
A 2022 100
A 2023 150
A 2024 180
B 2022 80
B 2023 90
B 2024 120

Long format gives you:

  • Tiny schema footprint — three columns the LLM has to remember, not three hundred.
  • Natural SQLGROUP BY year, WHERE product = 'A', AVG(sales).
  • Stable schema as data grows — new years are new rows, not new columns.

If you control the ETL that produces the data, reshape at load time (pandas melt, SQL UNPIVOT/VALUES, DuckDB's unpivot, Polars melt). If you don't control it, create a view in your warehouse that pivots wide to long and point datasight at the view. Use schema.yaml to hide the original wide table.

How schema.yaml differs from schema_description.md

They're complementary:

  • schema.yaml controls which tables and columns the LLM sees. It's a filter on structure.
  • schema_description.md adds meaning to what the LLM sees — domain context, column definitions, join hints. It's a narrative overlay.

A typical project uses both: schema.yaml to keep the prompt small, schema_description.md to make the remaining tables legible.

Reference

File location
<project>/schema.yaml
Behavior when missing
All introspected tables are exposed (backward-compatible default).
Validation
Unknown table names are logged as warnings; unknown column names are dropped with a warning. Neither blocks startup.
Performance
When schema.yaml is present, datasight skips column and row-count probes for excluded tables at startup — on a 131-table Postgres DB filtered to 7 tables, this cuts introspection from ~26 seconds to under a second.