Registration Checks

When you register a dataset with dsgrid registry datasets register, dsgrid runs a series of validation checks on your configuration and data. This page describes each check, the order in which they run, and what to do when one fails.

Check Sequence

Registration proceeds in this order:

  1. Configuration validation — Pydantic model validators fire when your configuration file is loaded.

  2. Duplicate registration — dsgrid verifies the dataset is not already in the registry.

  3. Time consistency — timestamps in the data are checked against the time dimension definition.

  4. Write to registry — the dataset is written (in unpivoted format) to the registry store. This happens before the remaining checks so that they can operate on the canonical unpivoted form. If a later check fails, the written data is removed.

  5. Required dimensions — dsgrid confirms every required dimension type is present (unless the project’s DatasetDimensionRequirements opts out via require_all_dimension_types).

  6. Schema checks — column-level validation of the data tables (value columns, data types, NULLs, and table-format-specific consistency).

  7. Dimension association completeness — the cross-join of all non-time dimension records must be present in the data (minus any explicitly declared missing associations).

Configuration Validation

These checks run automatically when your JSON/JSON5 configuration file is parsed.

dataset_id format

Must be a lowercase identifier containing only letters, digits, hyphens, and underscores. Leading digits and leading hyphens are not allowed.

Unique dimension filenames

Every dimension file listed in the configuration must have a unique filename.

Unique dimension names

Every dimension’s name field must be unique within the dataset.

Time dimension is not trivial

The time dimension type may not be “trivial” (the NoOp time type).

Layout field mutual exclusivity

You may set data_layout (for initial registration) or registry_data_layout (internal), but not both.

Pivoted-format fields

If data_layout is set to a pivoted format, pivoted_dimension_type must also be set.

Two Table format fields

If you are using the Two Table format, the load_data_lookup field must point to the lookup file.

Time Consistency

These checks run on the data in its original format (before unpivoting), because checking timestamps on an unpivoted table with many value columns would multiply the work.

Timestamp range

The time range in the data must match the range declared in the time dimension’s ranges field.

Uniform time arrays

Every combination of non-time dimensions must have the same set of timestamps. dsgrid checks both the count and the content of each group’s time array to ensure uniformity.

Model-year consistency (annual + historical)

For datasets with annual time resolution and a data_classification of historical, every row’s timestamp year must equal its model_year value.

Chronify-based checks

When the time dimension supports the chronify library, dsgrid delegates validation to chronify, which performs its own range and completeness checks.

Skipping time checks

Set the environment variable __DSGRID_SKIP_CHECK_DATASET_TIME_CONSISTENCY__ to any value, or set check_time_consistency: false in the project’s DatasetDimensionRequirements for this dataset.

Schema Checks

Schema checks verify the structure of the Parquet data tables. The exact checks depend on whether you are using the One Table or Two Table format.

One Table

Value column present

The value_column declared in the configuration must exist in the load data table.

No unexpected columns

The load data table may only contain dimension columns and the value column. Any extra columns cause an error.

Dimension columns are strings

All dimension columns must have StringType.

No NULL dimension values

Dimension columns may not contain NULL values.

Two Table

All One Table checks apply to the joined table (load data joined with the lookup table on the id column). In addition:

Lookup id column

The lookup table must contain a column named id.

Lookup dimension columns are strings

All dimension columns in the lookup table must have StringType.

No NULL values in lookup

The lookup table may not contain NULL dimension values.

ID set consistency

The set of id values in the load data table must exactly match the set of id values in the lookup table. dsgrid logs the specific differences when they do not match.

Missing dimensions warning

If any expected dimension columns are absent from the lookup table, dsgrid logs a warning (but does not fail).

Skipping schema checks

Set the environment variable __DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__ to any value. This skips both the schema checks and the dimension association check described next.

Dimension Association Completeness

This check verifies that the data contains every required combination of non-time dimension records.

How it works

  1. Per-column check — For each dimension type, dsgrid compares the distinct values in the data against the declared dimension records. This catches simple cases (a missing geography ID, for example) and produces clear error messages.

  2. Full cross-join check — dsgrid computes the expected cross-join of all non-time dimension records, subtracts any rows listed in the dataset’s missing-dimension-associations tables, and compares the result to the distinct dimension combinations actually present in the data.

When it fails

If the data is missing required dimension combinations, dsgrid:

  • Writes the missing rows to a Parquet file named {dataset_id}__missing_dimension_record_combinations.parquet in the current working directory.

  • Runs the Rust-based find_minimal_patterns analysis on the missing rows to identify the smallest sets of dimension values that explain the gaps. The top 10 patterns are logged.

  • Raises DSGInvalidDataset with a pointer to the log file for details.

Tip

The patterns output is the fastest way to diagnose the problem. A pattern like county = 06037 (500 missing rows) tells you that every combination involving county 06037 is absent — likely the county ID is wrong or the county was omitted from your data.

Declaring expected missing associations

If your dataset intentionally omits certain dimension combinations (for example, a technology that does not apply in certain states), you can declare them as missing dimension associations so that dsgrid subtracts them before checking. See How to Handle Missing Dimension Associations for details.

Skipping this check

Set check_dimension_associations: false in the project’s DatasetDimensionRequirements for this dataset. Alternatively, set __DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__, which skips both this check and the schema checks above.

Environment Variable Reference

These environment variables are only allowed in offline mode. dsgrid will refuse to start an online (cloud) registration if any of them are set.

Variable

Effect

__DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__

Skips schema checks and dimension association completeness.

__DSGRID_SKIP_CHECK_DATASET_TIME_CONSISTENCY__

Skips time consistency checks.

__DSGRID_SKIP_CHECK_NULL_DIMENSION__

Skips the NULL-value check on dimension columns after mapping application.

Warning

These variables exist as escape hatches for situations where a check is failing due to a known issue (such as intermittent Spark GC timeouts on very large datasets). Skipping checks means invalid data can enter the registry.