Data Validation with dsgrid

STRIDE uses dsgrid to validate and register datasets. This ensures data consistency across dimensions like time, geography, and sector before computing energy projections.

What is dsgrid?

dsgrid is a framework for managing demand-side grid data. STRIDE leverages dsgrid’s registry system to:

  • Validate dimensions - Ensure datasets have consistent time periods, geographies, and sectors

  • Map dimensions - Transform dataset dimensions to match project requirements

  • Query data - Extract and combine data from multiple registered datasets

The Validation Process

When you create a STRIDE project, the following validation steps occur:

1. Registry Creation

STRIDE creates a local dsgrid registry backed by DuckDB:

<project>/registry_data/data.duckdb

This registry stores metadata about registered datasets and their dimensions.

2. Bulk Registration

Datasets from the data directory are registered with dsgrid using bulk registration. This process:

  • Parses dataset configurations

  • Validates dimension consistency

  • Records dimension mappings

3. Dimension Mapping

STRIDE reads dimension_mappings.json5 from the dataset directory to understand how to map dimensions between datasets. Common mapping types include:

  • many_to_one_aggregation - Combine multiple source values into one target value

  • one_to_one - Direct mapping between source and target dimensions

4. Query and Table Creation

After validation, STRIDE queries the dsgrid registry and creates DuckDB tables for each dataset:

dsgrid_data.baseline__energy_intensity__1_0_0
dsgrid_data.baseline__gdp__1_0_0
dsgrid_data.baseline__load_shapes__1_0_0
...

What Gets Validated

Time Consistency

By default, STRIDE checks that time dimensions are consistent across datasets. This ensures:

  • All datasets cover the same time periods

  • Timestamps align properly for joining

Dimension Associations

Optionally, STRIDE can validate that dimension associations are consistent. This checks that:

  • Geographic identifiers match across datasets

  • Sector definitions are compatible

  • All required dimension combinations exist

Validation Errors

If validation fails, you’ll see errors indicating:

  • Missing dimensions - A required dimension is not present in the dataset

  • Inconsistent time periods - Datasets have mismatched time ranges

  • Invalid mappings - Dimension mappings reference non-existent values

Scenarios and Alternative Datasets

STRIDE supports multiple scenarios, each potentially using different input datasets:

{
  "scenarios": [
    {"name": "baseline"},
    {"name": "high_growth", "gdp": "path/to/alternative_gdp.parquet"}
  ]
}

When a scenario specifies an alternative dataset:

  1. The alternative is registered as a separate dataset

  2. A view is created pointing to the alternative data

  3. dbt uses the scenario-specific data for calculations

For datasets not overridden in a scenario, STRIDE creates views pointing to the baseline data to avoid redundant processing.