Limit the schema sent to the LLM (schema.yaml)¶
When datasight answers a question, it puts your database schema into the LLM's system prompt — every table, every column, every data type. On small projects this is fine. On a database with hundreds of tables or tables with hundreds of columns, it becomes a problem:
- The prompt blows past the model's context window (especially on free tiers — see the GitHub Models warning below).
- Even when it fits, large prompts are slower and more expensive per turn.
- The LLM has a harder time picking the right table when 130 irrelevant ones are competing for attention.
schema.yaml is a project-level file that tells datasight which tables
(and optionally which columns within each table) to expose to the LLM. Everything
not listed is hidden from prompts and from the web UI's table list.
Quick start¶
Create a schema.yaml in your project directory alongside queries.yaml
and time_series.yaml:
Reload the project (or restart datasight web) and only those three tables
will be visible to the LLM and the UI. Tables in the database that aren't
listed are dropped — no warnings, no errors, they simply don't exist as far
as datasight is concerned for this session.
Column filtering¶
Two optional fields, pick whichever fits the table:
excluded_columns — glob denylist (drift-friendly)¶
Most tables only need a few columns hidden. Add excluded_columns with
fnmatch-style patterns (*, ?, [abc]):
tables:
- name: orders
excluded_columns: [internal_notes, debug_flag]
- name: sensor_readings
excluded_columns:
- "debug_*" # any column starting with debug_
- "sensor_[1-9]*" # sensors 10 and up
- "_internal_*"
Every column the DB reports is exposed except those matching a pattern. When the DB gains a new column, it appears automatically — no action needed unless you want to hide it.
datasight generate writes excluded_columns: [] per table as a
placeholder so you can see where to add patterns.
columns — exact allowlist (strict)¶
For tables where the schema is wide and stable and you want exactly a
known subset, use columns:
Only the listed columns are exposed, in the order given. Put the most important columns first — LLMs pay more attention to what comes earlier. Unknown names are dropped with a warning. New DB columns do not appear until you add them to the list — this is the point of allowlist mode.
Rules¶
- Neither field → every column is exposed.
columnsonly → exact allowlist.excluded_columnsonly → denylist with globs.- Both set (non-empty) →
columnswins andexcluded_columnsis ignored for that table, with a warning. One mode per table; if you want "strict minus one", just take that one out ofcolumns. - Empty list (
columns: []orexcluded_columns: []) → behaves the same as omitting the field.
Column filtering is done in-memory after introspection — the underlying SQL runner can still access any column, so queries the LLM emits against hidden columns would still execute. The point is to prevent the LLM from knowing those columns exist so it doesn't try.
Detecting schema drift¶
Every time a project loads, datasight logs one line per table whose filter hides something in the live DB:
For excluded_columns, drift shows up as a shrinking "exposed" count
relative to "total" — a new column the DB added isn't hidden (since it
doesn't match the patterns), but this line tells you the table grew. For
columns (strict allowlist), drift shows up as new DB columns appearing
in the "hidden" preview — list them in columns if you want them
exposed.
Tables whose filter hides nothing log no line — no noise when there's no news.
Auto-generated by datasight generate¶
datasight generate writes schema.yaml listing the selected tables,
each with an empty excluded_columns: [] placeholder:
The workflow: open the file and either fill in globs under
excluded_columns: to hide columns, or (for a few wide tables) convert
that line to columns: [...] with an explicit allowlist.
datasight generate --table orders --table customersemits those two tables.datasight generate(no--table) emits every table in the database.datasight generate --compact-schemadrops theexcluded_columnsplaceholder, emitting just table names. Use this when you expect to do table-level filtering only and don't want the placeholder clutter.
Re-running generate --overwrite rewrites schema.yaml from scratch,
discarding any hand-edits. Same caveat as with queries.yaml — version
your project directory.
When to use this¶
| Situation | What to do |
|---|---|
| Database has < 20 small tables | You don't need schema.yaml. |
| Database has 50+ tables, but only a handful are relevant | List just those tables. |
| A few tables have 50+ columns each | Add column filters for those tables. |
| Using GitHub Models or another ~8k-token-limited backend | You almost certainly need this. See the warning below. |
| Prompt works but responses are slow or hallucinate tables | Try narrowing first — often a bigger quality win than a better model. |
GitHub Models and the 8,000-token limit
The free GitHub Models tier caps input+output tokens at 8,000 per request. The default datasight system prompt (instructions + tools + conversation history) already takes 1,500–2,500 tokens before your schema is added. That leaves roughly 5,500 tokens for schema + question + SQL + results.
A table with 40 columns and reasonable names uses ~200–400 schema tokens. Twenty such tables are enough to break GitHub Models, with no room left for the actual conversation. 131 tables will fail every time.
If you see errors like maximum context length exceeded or
request too large for gpt-4o-mini on tier free, you have two options:
- Use
schema.yamlto limit the schema (this page). - Switch to a backend with a larger context window — OpenAI (128k), Anthropic (200k), or a local Ollama model (typically 8k–32k, check the specific model). See Choosing an LLM.
Option 1 often improves answer quality regardless of model, so try it first.
Warning: wide tables and the long-format alternative¶
If a single table has dozens or hundreds of columns — especially if most
of those columns are variants of the same measurement (e.g. sales_2020,
sales_2021, sales_2022, ... or temp_sensor_01, temp_sensor_02, ...)
— the schema alone can exceed the token budget even with table filtering.
This is called wide format, and LLMs struggle with it:
- Column names carry semantic information the LLM has to re-parse every
turn (
sales_q3_2024= "quarter 3 of 2024 sales"). - Questions like "which year had the highest sales?" require the LLM to
generate a
UNION ALLor aVALUESunpivot — error-prone and verbose. - Adding a new year means the LLM has to relearn the schema.
Prefer long (narrow) format — one column per kind of thing, plus a row per observation:
| product | sales_2022 | sales_2023 | sales_2024 |
|---|---|---|---|
| A | 100 | 150 | 180 |
| B | 80 | 90 | 120 |
| product | year | sales |
|---|---|---|
| A | 2022 | 100 |
| A | 2023 | 150 |
| A | 2024 | 180 |
| B | 2022 | 80 |
| B | 2023 | 90 |
| B | 2024 | 120 |
Long format gives you:
- Tiny schema footprint — three columns the LLM has to remember, not three hundred.
- Natural SQL —
GROUP BY year,WHERE product = 'A',AVG(sales). - Stable schema as data grows — new years are new rows, not new columns.
If you control the ETL that produces the data, reshape at load time (pandas
melt, SQL UNPIVOT/VALUES, DuckDB's unpivot, Polars melt). If you
don't control it, create a view in your warehouse that pivots wide to long
and point datasight at the view. Use schema.yaml to hide the original
wide table.
How schema.yaml differs from schema_description.md¶
They're complementary:
schema.yamlcontrols which tables and columns the LLM sees. It's a filter on structure.schema_description.mdadds meaning to what the LLM sees — domain context, column definitions, join hints. It's a narrative overlay.
A typical project uses both: schema.yaml to keep the prompt small,
schema_description.md to make the remaining tables legible.
Reference¶
- File location
<project>/schema.yaml- Behavior when missing
- All introspected tables are exposed (backward-compatible default).
- Validation
- Unknown table names are logged as warnings; unknown column names are dropped with a warning. Neither blocks startup.
- Performance
- When
schema.yamlis present, datasight skips column and row-count probes for excluded tables at startup — on a 131-table Postgres DB filtered to 7 tables, this cuts introspection from ~26 seconds to under a second.