# Registration Checks When you register a dataset with `dsgrid registry datasets register`, dsgrid runs a series of validation checks on your configuration and data. This page describes each check, the order in which they run, and what to do when one fails. ## Check Sequence Registration proceeds in this order: 1. **Configuration validation** — Pydantic model validators fire when your configuration file is loaded. 2. **Duplicate registration** — dsgrid verifies the dataset is not already in the registry. 3. **Time consistency** — timestamps in the data are checked against the time dimension definition. 4. **Write to registry** — the dataset is written (in unpivoted format) to the registry store. This happens *before* the remaining checks so that they can operate on the canonical unpivoted form. If a later check fails, the written data is removed. 5. **Required dimensions** — dsgrid confirms every required dimension type is present (unless the project's `DatasetDimensionRequirements` opts out via `require_all_dimension_types`). 6. **Schema checks** — column-level validation of the data tables (value columns, data types, NULLs, and table-format-specific consistency). 7. **Dimension association completeness** — the cross-join of all non-time dimension records must be present in the data (minus any explicitly declared missing associations). ## Configuration Validation These checks run automatically when your JSON/JSON5 configuration file is parsed. `dataset_id` format : Must be a lowercase identifier containing only letters, digits, hyphens, and underscores. Leading digits and leading hyphens are not allowed. Unique dimension filenames : Every dimension file listed in the configuration must have a unique filename. Unique dimension names : Every dimension's `name` field must be unique within the dataset. Time dimension is not trivial : The time dimension type may not be "trivial" (the `NoOp` time type). Layout field mutual exclusivity : You may set `data_layout` (for initial registration) *or* `registry_data_layout` (internal), but not both. Pivoted-format fields : If `data_layout` is set to a pivoted format, `pivoted_dimension_type` must also be set. Two Table format fields : If you are using the Two Table format, the `load_data_lookup` field must point to the lookup file. ## Time Consistency These checks run on the data in its original format (before unpivoting), because checking timestamps on an unpivoted table with many value columns would multiply the work. Timestamp range : The time range in the data must match the range declared in the time dimension's `ranges` field. Uniform time arrays : Every combination of non-time dimensions must have the same set of timestamps. dsgrid checks both the count and the content of each group's time array to ensure uniformity. Model-year consistency (annual + historical) : For datasets with annual time resolution and a `data_classification` of `historical`, every row's timestamp year must equal its `model_year` value. Chronify-based checks : When the time dimension supports the chronify library, dsgrid delegates validation to chronify, which performs its own range and completeness checks. ### Skipping time checks Set the environment variable `__DSGRID_SKIP_CHECK_DATASET_TIME_CONSISTENCY__` to any value, or set `check_time_consistency: false` in the project's `DatasetDimensionRequirements` for this dataset. ## Schema Checks Schema checks verify the structure of the Parquet data tables. The exact checks depend on whether you are using the One Table or Two Table format. ### One Table Value column present : The `value_column` declared in the configuration must exist in the load data table. No unexpected columns : The load data table may only contain dimension columns and the value column. Any extra columns cause an error. Dimension columns are strings : All dimension columns must have `StringType`. No NULL dimension values : Dimension columns may not contain NULL values. ### Two Table All One Table checks apply to the joined table (load data joined with the lookup table on the `id` column). In addition: Lookup `id` column : The lookup table must contain a column named `id`. Lookup dimension columns are strings : All dimension columns in the lookup table must have `StringType`. No NULL values in lookup : The lookup table may not contain NULL dimension values. ID set consistency : The set of `id` values in the load data table must exactly match the set of `id` values in the lookup table. dsgrid logs the specific differences when they do not match. Missing dimensions warning : If any expected dimension columns are absent from the lookup table, dsgrid logs a warning (but does not fail). ### Skipping schema checks Set the environment variable `__DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__` to any value. This skips both the schema checks and the dimension association check described next. ## Dimension Association Completeness This check verifies that the data contains every required combination of non-time dimension records. ### How it works 1. **Per-column check** — For each dimension type, dsgrid compares the distinct values in the data against the declared dimension records. This catches simple cases (a missing geography ID, for example) and produces clear error messages. 2. **Full cross-join check** — dsgrid computes the expected cross-join of all non-time dimension records, subtracts any rows listed in the dataset's missing-dimension-associations tables, and compares the result to the distinct dimension combinations actually present in the data. ### When it fails If the data is missing required dimension combinations, dsgrid: - Writes the missing rows to a Parquet file named `{dataset_id}__missing_dimension_record_combinations.parquet` in the current working directory. - Runs the Rust-based `find_minimal_patterns` analysis on the missing rows to identify the smallest sets of dimension values that explain the gaps. The top 10 patterns are logged. - Raises `DSGInvalidDataset` with a pointer to the log file for details. ```{tip} The patterns output is the fastest way to diagnose the problem. A pattern like `county = 06037 (500 missing rows)` tells you that every combination involving county 06037 is absent — likely the county ID is wrong or the county was omitted from your data. ``` ### Declaring expected missing associations If your dataset intentionally omits certain dimension combinations (for example, a technology that does not apply in certain states), you can declare them as *missing dimension associations* so that dsgrid subtracts them before checking. See {doc}`../how_tos/how_to_missing_associations` for details. ### Skipping this check Set `check_dimension_associations: false` in the project's `DatasetDimensionRequirements` for this dataset. Alternatively, set `__DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__`, which skips both this check and the schema checks above. ## Environment Variable Reference These environment variables are **only allowed in offline mode**. dsgrid will refuse to start an online (cloud) registration if any of them are set. ```{list-table} :header-rows: 1 * - Variable - Effect * - `__DSGRID_SKIP_CHECK_DATASET_CONSISTENCY__` - Skips schema checks and dimension association completeness. * - `__DSGRID_SKIP_CHECK_DATASET_TIME_CONSISTENCY__` - Skips time consistency checks. * - `__DSGRID_SKIP_CHECK_NULL_DIMENSION__` - Skips the NULL-value check on dimension columns after mapping application. ``` ```{warning} These variables exist as escape hatches for situations where a check is failing due to a known issue (such as intermittent Spark GC timeouts on very large datasets). Skipping checks means invalid data can enter the registry. ```