Audit data quality¶
datasight provides deterministic commands that inspect your data directly — no LLM needed. Use them to find nulls, suspicious ranges, date gaps, dimension breakdowns, and trend candidates before asking AI-powered questions.
These commands work with both configured projects and files passed directly
to datasight inspect.
Quick audit: datasight inspect¶
For a single-command overview that includes quality checks alongside everything else:
This runs profile, quality, measures, dimensions, trends, and recipes in one pass. See Explore files without a project for details.
The sections below describe how to run each analysis individually for more focused work from a project directory.
Profile the dataset¶
datasight profile gives you a high-level map of the data:
# Whole dataset
datasight profile
# One table
datasight profile --table generation_fuel
# One column
datasight profile --column generation_fuel.report_date
Use it to answer:
- which tables are largest
- which columns look like dates or measures
- which text columns look like useful grouping dimensions
- what representative values appear in important columns
Check data quality¶
datasight quality surfaces lightweight QA issues:
datasight quality
datasight quality --table generation_fuel
datasight quality --format markdown -o quality.md
Use it to spot:
- null-heavy columns (10%+ nulls)
- suspicious numeric ranges (constant values, averages at boundaries)
- date coverage gaps
- quick notes worth turning into follow-up questions
- temporal completeness issues when
time_series.yamlis present
Find measures¶
datasight measures infers likely metrics and their aggregation semantics:
The output includes:
- a semantic role such as
energy,power,capacity,rate, orratio - the default aggregation datasight will prefer
- forbidden aggregations such as avoiding
SUMon non-additive metrics - weighted-average guidance when a valid denominator is available
- a suggested SQL rollup shape
Generate a starting measures.yaml template:
See Configure semantic measures for the full override workflow.
Find dimensions¶
datasight dimensions identifies likely grouping columns:
Use it to discover:
- candidate grouping columns with distinct counts and sample values
- suggested category breakdowns with reasons
- join hints between tables
Find trends¶
datasight trends surfaces date/measure pairs for time-series analysis:
It reports:
- candidate date/measure pairs
- date ranges
- suggested aggregations
- lightweight chart recommendations
Generate prompt recipes¶
datasight recipes list builds reusable prompts from the schema and
profiling output:
datasight recipes list
datasight recipes list --table generation_fuel
datasight recipes list --format markdown -o recipes.md
Each recipe includes an id. Execute one directly through the AI:
Check referential integrity¶
datasight integrity looks across tables for primary-key and foreign-key
issues:
datasight integrity
datasight integrity --table generation_fuel
datasight integrity --format markdown -o integrity.md
It reports:
- Primary keys — columns named
idor{table}_idwhose values are 100% distinct, so they can be safely used as PKs - Duplicate keys — inferred PKs that actually have duplicates
- Orphan foreign keys — columns ending in
_idwhose values don't exist in the referenced parent table - Join explosion risks — FK relationships where joining multiplies the row count (a one-to-many where you expected one-to-one)
Declaring joins explicitly¶
datasight infers foreign keys by naming convention, but you can declare them
explicitly in a joins.yaml file in your project directory:
- child_table: generation_fuel
child_column: plant_id
parent_table: plants
parent_column: id # optional, defaults to "id"
- child_table: generation_fuel
child_column: energy_source_code
parent_table: energy_sources
parent_column: code
Profile distributions¶
datasight distribution computes percentile summaries for each numeric
non-identifier column:
datasight distribution
datasight distribution --table generation_fuel
datasight distribution --column generation_fuel.net_generation_mwh
datasight distribution --format markdown -o distribution.md
For each column it reports:
- Percentiles: p1, p5, p50, p95, p99
- Spread: Q1, Q3, IQR, mean, stddev, coefficient of variation
- Rates: zero rate and negative rate
- IQR outliers: count of values outside 1.5 × IQR of the quartiles
- Temporal spikes: per-month averages more than 3σ from the mean
Energy-domain flags¶
When a column is recognized as an energy-domain measure (via semantic measure inference), the command raises targeted warnings:
- Negative generation — an
energy-role column with negative values - Capacity factor > 1 — a
ratiocolumn whose p99 exceeds 1.0 - Implausible heat rate — a
ratecolumn in MMBtu/MWh outside 3–30 - Zero values in rate — rate/ratio/price columns containing zeros that may indicate zero-denominator bugs upstream
Declarative validation rules¶
datasight validate runs declarative rules defined in validation.yaml.
Use this to lock in expectations about your data that should hold across
every refresh.
Generate a starting template, then edit it:
This writes validation.yaml into the project directory. Example:
- table: generation_fuel
rules:
- type: required_columns
columns: [plant_id, report_date, energy_source_code, net_generation_mwh]
- type: max_null_rate
column: net_generation_mwh
threshold: 0.05
- type: numeric_range
column: net_generation_mwh
min: 0
- type: allowed_values
column: energy_source_code
values: [NG, COL, NUC, SUN, WND, WAT, OTH, PET, GEO, BIO, WDS, OOG]
- type: uniqueness
columns: [plant_id, report_date, energy_source_code]
- type: row_count
min: 100
- type: freshness
column: report_date
max_age_days: 90
Supported rule types¶
| Rule | Purpose |
|---|---|
required_columns |
Columns that must exist in the schema |
max_null_rate |
Maximum allowed fraction of NULL values (0.0–1.0) |
numeric_range |
min and/or max bounds for a numeric column |
allowed_values |
Whitelist of accepted category values |
regex |
Pattern that all non-null values must match |
uniqueness |
Composite key that must be unique across rows |
monotonic |
Column must be non-decreasing or strictly increasing |
row_count |
Table row count must fall within min/max bounds |
freshness |
Latest date must be within max_age_days of today |
Run the rules:
datasight validate
datasight validate --table generation_fuel
datasight validate --format markdown -o validation.md
Each result is reported as pass, fail, or warn, with a short detail
string explaining the violation.
Combined audit report¶
datasight audit-report runs the dataset overview, quality, integrity,
distribution, and (if configured) validation checks together and writes a
single self-contained file:
# HTML (default output: report.html)
datasight audit-report
# Choose a different output
datasight audit-report -o audit.md
datasight audit-report -o audit.json --format json
# Scope to a single table
datasight audit-report --table generation_fuel -o plant_audit.html
The HTML report is self-contained (inline CSS, no external assets) so you can email it, attach it to a PR, or drop it into a wiki. The project directory name appears in the report title.
Recommended workflow¶
For a thorough data quality audit, run the commands in this order:
- Profile — understand the shape of the data
- Quality — find nulls, range issues, and date gaps
- Integrity — verify primary keys, foreign keys, and join behavior
- Distribution — inspect percentiles, outliers, and temporal spikes
- Measures — identify metrics and verify aggregation defaults
- Dimensions — find grouping columns for breakdowns
- Trends — discover time-series candidates
- Validate — codify expectations in
validation.yamland run them on every refresh
Or skip straight to datasight inspect for profile/quality/measures in one
pass, or datasight audit-report to roll all of the audit checks into a
single shareable document.
Once you know what questions to ask, move into batch mode:
See Ask questions from the CLI for batch workflows.
Output formats¶
All commands support --format and --output:
datasight quality --format json
datasight quality --format markdown -o quality.md
datasight profile --format json -o profile.json
Available formats: table (default), json, markdown.