(data-validation)= # Data Validation with dsgrid STRIDE uses [dsgrid](https://github.com/dsgrid/dsgrid) to validate and register datasets. This ensures data consistency across dimensions like time, geography, and sector before computing energy projections. ## What is dsgrid? dsgrid is a framework for managing demand-side grid data. STRIDE leverages dsgrid's registry system to: - **Validate dimensions** - Ensure datasets have consistent time periods, geographies, and sectors - **Map dimensions** - Transform dataset dimensions to match project requirements - **Query data** - Extract and combine data from multiple registered datasets ## The Validation Process When you create a STRIDE project, the following validation steps occur: ### 1. Registry Creation STRIDE creates a local dsgrid registry backed by DuckDB: ``` /registry_data/data.duckdb ``` This registry stores metadata about registered datasets and their dimensions. ### 2. Bulk Registration Datasets from the data directory are registered with dsgrid using bulk registration. This process: - Parses dataset configurations - Validates dimension consistency - Records dimension mappings ### 3. Dimension Mapping STRIDE reads `dimension_mappings.json5` from the dataset directory to understand how to map dimensions between datasets. Common mapping types include: - **many_to_one_aggregation** - Combine multiple source values into one target value - **one_to_one** - Direct mapping between source and target dimensions ### 4. Query and Table Creation After validation, STRIDE queries the dsgrid registry and creates DuckDB tables for each dataset: ``` dsgrid_data.baseline__energy_intensity__1_0_0 dsgrid_data.baseline__gdp__1_0_0 dsgrid_data.baseline__load_shapes__1_0_0 ... ``` ## What Gets Validated ### Time Consistency By default, STRIDE checks that time dimensions are consistent across datasets. This ensures: - All datasets cover the same time periods - Timestamps align properly for joining ### Dimension Associations Optionally, STRIDE can validate that dimension associations are consistent. This checks that: - Geographic identifiers match across datasets - Sector definitions are compatible - All required dimension combinations exist ## Validation Errors If validation fails, you'll see errors indicating: - **Missing dimensions** - A required dimension is not present in the dataset - **Inconsistent time periods** - Datasets have mismatched time ranges - **Invalid mappings** - Dimension mappings reference non-existent values ## Scenarios and Alternative Datasets STRIDE supports multiple scenarios, each potentially using different input datasets: ```json5 { "scenarios": [ {"name": "baseline"}, {"name": "high_growth", "gdp": "path/to/alternative_gdp.parquet"} ] } ``` When a scenario specifies an alternative dataset: 1. The alternative is registered as a separate dataset 2. A view is created pointing to the alternative data 3. dbt uses the scenario-specific data for calculations For datasets not overridden in a scenario, STRIDE creates views pointing to the baseline data to avoid redundant processing. ## Related Topics - {ref}`customizing-checks` - How to enable or disable specific validation checks - {ref}`dbt-computation` - How validated data flows into dbt calculations