Verify queries across models¶

datasight generates SQL from natural language, which means the same question could produce different queries depending on the model, provider, or even the run. The datasight verify command helps you validate that your example queries produce correct results regardless of which LLM is used.

The verification workflow¶

flowchart LR
    A[queries.yaml<br>+ expected] --> B[datasight verify]
    B --> C[Ambiguity<br>analysis]
    B --> D[SQL generation<br>+ result checks]
    C --> E[Warnings:<br>ambiguous questions]
    D --> F[Pass/fail<br>report]

    style A fill:#8a7d55,stroke:#6b6040,color:#fff
    style B fill:#15a8a8,stroke:#023d60,color:#fff
    style C fill:#fe5d26,stroke:#c44a1e,color:#fff
    style D fill:#fe5d26,stroke:#c44a1e,color:#fff
    style E fill:#2e7ebb,stroke:#1a5c8a,color:#fff
    style F fill:#2e7ebb,stroke:#1a5c8a,color:#fff

datasight verify runs in two phases:

Ambiguity analysis -- checks each question for structural ambiguities that could produce different SQL (temporal granularity, missing counts, etc.)
SQL verification -- sends each question through the LLM, executes the generated SQL, and compares results against the reference SQL or explicit expected values

Add expected results to queries.yaml¶

Add an expected block to any entry in queries.yaml to define what correct output looks like:

- question: What are the top 5 states by solar generation?
  sql: |
    SELECT p.state, SUM(g.net_generation_mwh) AS solar_mwh
    FROM generation_fuel g
    JOIN plants p USING (plant_id_eia)
    WHERE g.energy_source_code = 'SUN'
    GROUP BY p.state
    ORDER BY solar_mwh DESC
    LIMIT 5
  expected:
    row_count: 5
    columns: [state, solar_mwh]
    contains: ["CA", "TX"]

Available checks¶

Field	Description
`row_count`	Exact number of rows expected
`min_row_count`	Minimum number of rows
`max_row_count`	Maximum number of rows
`columns`	Exact column names in order
`contains`	Values that must appear somewhere in the results
`not_contains`	Values that must NOT appear in the results

When no expected block is present, datasight verify falls back to comparing the LLM-generated SQL output against the reference SQL output (row count and column names).

Run verification¶

datasight verify

This runs all queries in queries.yaml against the configured model and database. Output includes an ambiguity analysis table (if any questions are flagged) followed by a pass/fail verification table:

           Ambiguity Analysis (2 warnings)
┃ # ┃ Question                          ┃ Ambiguities              ┃ Suggested Revision          ┃
│ 4 │ Compare coal vs gas over time     │ - "over time" does not   │ Compare coal vs gas monthly │
│   │                                   │   specify granularity    │ generation from 2020-2024   │

                        Verification Results
┃ # ┃ Question                          ┃ Status ┃ Checks               ┃
│ 1 │ Top 10 plants by generation       │ PASS   │ ✓ row_count: 10      │
│ 4 │ Compare coal vs gas over time     │ FAIL   │ ✗ row_count: ...     │

8/9 passed (1 failed), 2 ambiguous

The exit code is 0 if all queries pass, 1 if any fail -- making it suitable for CI pipelines.

Options¶

Flag	Default	Description
`--model`	from `.env`	Override the model for this run
`--queries`	`queries.yaml`	Path to queries file
`--project-dir`	`.`	Project directory
`-v` / `--verbose`	off	Enable debug logging

Test across models¶

Run the same suite against different models to compare reliability:

datasight verify --model claude-sonnet-4-6
datasight verify --model claude-haiku-4-5-20251001
LLM_PROVIDER=github datasight verify --model gpt-4o
LLM_PROVIDER=ollama datasight verify --model qwen2.5:7b

Writing deterministic queries¶

The ambiguity analysis will flag common issues, but here are guidelines for writing questions that produce consistent results across models:

Specify temporal granularity. "Monthly generation over time" not "generation over time."
Include counts. "Top 10 states" not "top states."
Name the metric. "Largest plants by total MWh" not "largest plants."
Use explicit date ranges. "From 2020 to 2024" not "recent years."
Name the grouping column. "By state" not "by region" when multiple geographic columns exist.