Registration Checks¶
When you register a dataset with dsgrid registry datasets register, dsgrid
runs a series of validation checks on your configuration and data. This page
describes each check, the order in which they run, and what to do when one
fails.
Check Sequence¶
Registration proceeds in this order:
Configuration validation — Pydantic model validators fire when your configuration file is loaded.
Duplicate registration — dsgrid verifies the dataset is not already in the registry.
Time consistency — timestamps in the data are checked against the time dimension definition.
Write to registry — the dataset is written (in unpivoted format) to the registry store. This happens before the remaining checks so that they can operate on the canonical unpivoted form. If a later check fails, the written data is removed.
Required dimensions — dsgrid confirms every required dimension type is present (unless the project’s
DatasetDimensionRequirementsopts out viarequire_all_dimension_types).Schema checks — column-level validation of the data tables (value columns, data types, NULLs, and table-format-specific consistency).
Dimension association completeness — the cross-join of all non-time dimension records must be present in the data (minus any explicitly declared missing associations).
Configuration Validation¶
These checks run automatically when your JSON/JSON5 configuration file is parsed.
dataset_idformatMust be a lowercase identifier containing only letters, digits, hyphens, and underscores. Leading digits and leading hyphens are not allowed.
- Unique dimension filenames
Every dimension file listed in the configuration must have a unique filename.
- Unique dimension names
Every dimension’s
namefield must be unique within the dataset.- Time dimension is not trivial
The time dimension type may not be “trivial” (the
NoOptime type).- Layout field mutual exclusivity
You may set
data_layout(for initial registration) orregistry_data_layout(internal), but not both.- Pivoted-format fields
If
data_layoutis set to a pivoted format,pivoted_dimension_typemust also be set.- Two Table format fields
If you are using the Two Table format, the
load_data_lookupfield must point to the lookup file.
Time Consistency¶
These checks run on the data in its original format (before unpivoting), because checking timestamps on an unpivoted table with many value columns would multiply the work.
- Timestamp range
The time range in the data must match the range declared in the time dimension’s
rangesfield.- Uniform time arrays
Every combination of non-time dimensions must have the same set of timestamps. dsgrid checks both the count and the content of each group’s time array to ensure uniformity.
- Model-year consistency (annual + historical)
For datasets with annual time resolution and a
data_classificationofhistorical, every row’s timestamp year must equal itsmodel_yearvalue.- Chronify-based checks
When the time dimension supports the chronify library, dsgrid delegates validation to chronify, which performs its own range and completeness checks.
Skipping time checks¶
Set the environment variable
__DSGRID_SKIP_CHECK_DATASET_TIME_CONSISTENCY__ to any value, or set
check_time_consistency: false in the project’s
DatasetDimensionRequirements for this dataset.
Schema Checks¶
Schema checks verify the structure of the Parquet data tables. The exact checks depend on whether you are using the One Table or Two Table format.
One Table¶
- Value column present
The
value_columndeclared in the configuration must exist in the load data table.- No unexpected columns
The load data table may only contain dimension columns and the value column. Any extra columns cause an error.
- Dimension columns are strings
All dimension columns must have
StringType.- No NULL dimension values
Dimension columns may not contain NULL values.
Two Table¶
All One Table checks apply to the joined table (load data joined with the
lookup table on the id column). In addition:
- Lookup
idcolumn The lookup table must contain a column named
id.- Lookup dimension columns are strings
All dimension columns in the lookup table must have
StringType.- No NULL values in lookup
The lookup table may not contain NULL dimension values.
- ID set consistency
The set of
idvalues in the load data table must exactly match the set ofidvalues in the lookup table. dsgrid logs the specific differences when they do not match.- Missing dimensions warning
If any expected dimension columns are absent from the lookup table, dsgrid logs a warning (but does not fail).
Skipping schema checks¶
Set the environment variable __DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__ to
any value. This skips both the schema checks and the dimension association
check described next.
Dimension Association Completeness¶
This check verifies that the data contains every required combination of non-time dimension records.
How it works¶
Per-column check — For each dimension type, dsgrid compares the distinct values in the data against the declared dimension records. This catches simple cases (a missing geography ID, for example) and produces clear error messages.
Full cross-join check — dsgrid computes the expected cross-join of all non-time dimension records, subtracts any rows listed in the dataset’s missing-dimension-associations tables, and compares the result to the distinct dimension combinations actually present in the data.
When it fails¶
If the data is missing required dimension combinations, dsgrid:
Writes the missing rows to a Parquet file named
{dataset_id}__missing_dimension_record_combinations.parquetin the current working directory.Runs the Rust-based
find_minimal_patternsanalysis on the missing rows to identify the smallest sets of dimension values that explain the gaps. The top 10 patterns are logged.Raises
DSGInvalidDatasetwith a pointer to the log file for details.
Tip
The patterns output is the fastest way to diagnose the problem. A pattern like
county = 06037 (500 missing rows) tells you that every combination involving
county 06037 is absent — likely the county ID is wrong or the county was
omitted from your data.
Declaring expected missing associations¶
If your dataset intentionally omits certain dimension combinations (for example, a technology that does not apply in certain states), you can declare them as missing dimension associations so that dsgrid subtracts them before checking. See How to Handle Missing Dimension Associations for details.
Skipping this check¶
Set check_dimension_associations: false in the project’s
DatasetDimensionRequirements for this dataset.
Alternatively, set __DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__, which skips
both this check and the schema checks above.
Environment Variable Reference¶
These environment variables are only allowed in offline mode. dsgrid will refuse to start an online (cloud) registration if any of them are set.
Variable |
Effect |
|---|---|
|
Skips schema checks and dimension association completeness. |
|
Skips time consistency checks. |
|
Skips the NULL-value check on dimension columns after mapping application. |
Warning
These variables exist as escape hatches for situations where a check is failing due to a known issue (such as intermittent Spark GC timeouts on very large datasets). Skipping checks means invalid data can enter the registry.