Registration Checks¶

When you register a dataset with dsgrid registry datasets register, dsgrid runs a series of validation checks on your configuration and data. This page describes each check, the order in which they run, and what to do when one fails.

Check Sequence¶

Registration proceeds in this order:

Configuration validation — Pydantic model validators fire when your configuration file is loaded.
Duplicate registration — dsgrid verifies the dataset is not already in the registry.
Time consistency — timestamps in the data are checked against the time dimension definition.
Write to registry — the dataset is written (in unpivoted format) to the registry store. This happens before the remaining checks so that they can operate on the canonical unpivoted form. If a later check fails, the written data is removed.
Required dimensions — dsgrid confirms every required dimension type is present (unless the project’s DatasetDimensionRequirements opts out via require_all_dimension_types).
Schema checks — column-level validation of the data tables (value columns, data types, NULLs, and table-format-specific consistency).
Dimension association completeness — the cross-join of all non-time dimension records must be present in the data (minus any explicitly declared missing associations).

Configuration Validation¶

These checks run automatically when your JSON/JSON5 configuration file is parsed.

dataset_id format: Must be a lowercase identifier containing only letters, digits, hyphens, and underscores. Leading digits and leading hyphens are not allowed.
Unique dimension filenames: Every dimension file listed in the configuration must have a unique filename.
Unique dimension names: Every dimension’s name field must be unique within the dataset.
Time dimension is not trivial: The time dimension type may not be “trivial” (the NoOp time type).
Layout field mutual exclusivity: You may set data_layout (for initial registration) or registry_data_layout (internal), but not both.
Pivoted-format fields: If data_layout is set to a pivoted format, pivoted_dimension_type must also be set.
Two Table format fields: If you are using the Two Table format, the load_data_lookup field must point to the lookup file.

Time Consistency¶

These checks run on the data in its original format (before unpivoting), because checking timestamps on an unpivoted table with many value columns would multiply the work.

Timestamp range: The time range in the data must match the range declared in the time dimension’s ranges field.
Uniform time arrays: Every combination of non-time dimensions must have the same set of timestamps. dsgrid checks both the count and the content of each group’s time array to ensure uniformity.
Model-year consistency (annual + historical): For datasets with annual time resolution and a data_classification of historical, every row’s timestamp year must equal its model_year value.
Chronify-based checks: When the time dimension supports the chronify library, dsgrid delegates validation to chronify, which performs its own range and completeness checks.

Skipping time checks¶

Set the environment variable __DSGRID_SKIP_CHECK_DATASET_TIME_CONSISTENCY__ to any value, or set check_time_consistency: false in the project’s DatasetDimensionRequirements for this dataset.

Schema Checks¶

Schema checks verify the structure of the Parquet data tables. The exact checks depend on whether you are using the One Table or Two Table format.

One Table¶

Value column present: The value_column declared in the configuration must exist in the load data table.
No unexpected columns: The load data table may only contain dimension columns and the value column. Any extra columns cause an error.
Dimension columns are strings: All dimension columns must have StringType.
No NULL dimension values: Dimension columns may not contain NULL values.

Two Table¶

All One Table checks apply to the joined table (load data joined with the lookup table on the id column). In addition:

Lookup id column: The lookup table must contain a column named id.
Lookup dimension columns are strings: All dimension columns in the lookup table must have StringType.
No NULL values in lookup: The lookup table may not contain NULL dimension values.
ID set consistency: The set of id values in the load data table must exactly match the set of id values in the lookup table. dsgrid logs the specific differences when they do not match.
Missing dimensions warning: If any expected dimension columns are absent from the lookup table, dsgrid logs a warning (but does not fail).

Skipping schema checks¶

Set the environment variable __DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__ to any value. This skips both the schema checks and the dimension association check described next.

Dimension Association Completeness¶

This check verifies that the data contains every required combination of non-time dimension records.

How it works¶

Per-column check — For each dimension type, dsgrid compares the distinct values in the data against the declared dimension records. This catches simple cases (a missing geography ID, for example) and produces clear error messages.
Full cross-join check — dsgrid computes the expected cross-join of all non-time dimension records (or uses the dataset’s expected_associations if provided), subtracts any rows listed in the dataset’s missing-dimension-associations tables, and compares the result to the distinct dimension combinations actually present in the data.

When it fails¶

If the data is missing required dimension combinations, dsgrid:

Writes the missing rows to a Parquet file named {dataset_id}__missing_dimension_record_combinations.parquet in the current working directory.
Runs the Rust-based find_minimal_patterns analysis on the missing rows to identify the smallest sets of dimension values that explain the gaps. The top 10 patterns are logged.
Raises DSGInvalidDataset with a pointer to the log file for details.

Tip

The patterns output is the fastest way to diagnose the problem. A pattern like county = 06037 (500 missing rows) tells you that every combination involving county 06037 is absent — likely the county ID is wrong or the county was omitted from your data.

Declaring expected and missing associations¶

If your dataset intentionally omits certain dimension combinations (for example, a technology that does not apply in certain states), you can declare them as missing dimension associations so that dsgrid subtracts them before checking.

If the dataset is inherently sparse, you can instead (or additionally) provide expected dimension associations that replace the full cross-join with only the combinations that should be present.

See How to Handle Dimension Associations for details on both approaches.

Skipping this check¶

Set check_dimension_associations: false in the project’s DatasetDimensionRequirements for this dataset. Alternatively, set __DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__, which skips both this check and the schema checks above.

Environment Variable Reference¶

These environment variables are only allowed in offline mode. dsgrid will refuse to start an online (cloud) registration if any of them are set.

Variable	Effect
`__DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__`	Skips schema checks and dimension association completeness.
`__DSGRID_SKIP_CHECK_DATASET_TIME_CONSISTENCY__`	Skips time consistency checks.
`__DSGRID_SKIP_CHECK_NULL_DIMENSION__`	Skips the NULL-value check on dimension columns after mapping application.

Warning

These variables exist as escape hatches for situations where a check is failing due to a known issue (such as intermittent Spark GC timeouts on very large datasets). Skipping checks means invalid data can enter the registry.