Data Validation with dsgrid¶
STRIDE uses dsgrid to validate and register datasets. This ensures data consistency across dimensions like time, geography, and sector before computing energy projections.
What is dsgrid?¶
dsgrid is a framework for managing demand-side grid data. STRIDE leverages dsgrid’s registry system to:
Validate dimensions - Ensure datasets have consistent time periods, geographies, and sectors
Map dimensions - Transform dataset dimensions to match project requirements
Query data - Extract and combine data from multiple registered datasets
The Validation Process¶
When you create a STRIDE project, the following validation steps occur:
1. Registry Creation¶
STRIDE creates a local dsgrid registry backed by DuckDB:
<project>/registry_data/data.duckdb
This registry stores metadata about registered datasets and their dimensions.
2. Bulk Registration¶
Datasets from the data directory are registered with dsgrid using bulk registration. This process:
Parses dataset configurations
Validates dimension consistency
Records dimension mappings
3. Dimension Mapping¶
STRIDE reads dimension_mappings.json5 from the dataset directory to understand how to map dimensions between datasets. Common mapping types include:
many_to_one_aggregation - Combine multiple source values into one target value
one_to_one - Direct mapping between source and target dimensions
4. Query and Table Creation¶
After validation, STRIDE queries the dsgrid registry and creates DuckDB tables for each dataset:
dsgrid_data.baseline__energy_intensity__1_0_0
dsgrid_data.baseline__gdp__1_0_0
dsgrid_data.baseline__load_shapes__1_0_0
...
What Gets Validated¶
Time Consistency¶
By default, STRIDE checks that time dimensions are consistent across datasets. This ensures:
All datasets cover the same time periods
Timestamps align properly for joining
Dimension Associations¶
Optionally, STRIDE can validate that dimension associations are consistent. This checks that:
Geographic identifiers match across datasets
Sector definitions are compatible
All required dimension combinations exist
Validation Errors¶
If validation fails, you’ll see errors indicating:
Missing dimensions - A required dimension is not present in the dataset
Inconsistent time periods - Datasets have mismatched time ranges
Invalid mappings - Dimension mappings reference non-existent values
Scenarios and Alternative Datasets¶
STRIDE supports multiple scenarios, each potentially using different input datasets:
{
"scenarios": [
{"name": "baseline"},
{"name": "high_growth", "gdp": "path/to/alternative_gdp.parquet"}
]
}
When a scenario specifies an alternative dataset:
The alternative is registered as a separate dataset
A view is created pointing to the alternative data
dbt uses the scenario-specific data for calculations
For datasets not overridden in a scenario, STRIDE creates views pointing to the baseline data to avoid redundant processing.