Define Dimension Associations¶
In this tutorial you will learn how to define expected associations and missing associations for a dsgrid dataset, using a commercial building energy dataset (ComStock) as a worked example.
By the end you will understand:
How to decide whether to use expected associations, missing associations, or both.
How to create the association files.
How to reference them in a dataset config.
Note
This tutorial walks through the reasoning behind each step. For a concise reference, see How to Handle Dimension Associations.
Prerequisites¶
A dataset config file ready for registration (see Dataset Concepts and Data File Formats)
Familiarity with Dimension Concepts
Background: The Example Dataset¶
The example dataset (drawn from https://github.com/dsgrid/dsgrid-test-data) models a subset of commercial building energy loads for a subset of US counties. Its non-time dimensions are:
Dimension |
Records |
|---|---|
geography |
8 counties (e.g., Denver County CO, Harris County TX) |
sector |
2 sectors: |
subsector |
14 commercial building types (e.g., |
metric |
2 end uses: |
model_year |
1 value: |
weather_year |
1 value: |
scenario |
1 value: |
dsgrid validates associations across all non-time dimensions that appear as rows in the data. Pivoted dimensions (here, metric — whose record IDs are data-table column names) and trivial dimensions (single-valued dimensions not stored in the data) are excluded. For this example, that leaves three dimensions for association validation: geography, sector, and subsector.
If every combination of these three dimensions were present, the full cross-join would produce 8 × 2 × 14 = 224 combinations. But not every combination makes sense — and the actual data are sparser than that. Our job is to tell dsgrid exactly which combinations are valid and which are legitimately absent.
Step 1: Identify Which Combinations Are Valid¶
Before writing any association or config files, think about the relationships between dimensions in the data. Ask:
Which dimensions are not fully independent? Which combinations of records do not exist in the data?
For ComStock, two relationships stand out:
Sector–subsector relationship. Of the 14 building types, 13 are commercial (
com) and one,com__MidriseApartment, belongs to theres(residential) sector. The 13 commercial (com) subsectors have no data underres, andcom__MidriseApartmenthas no data undercom. This means 13 + 1 = 14 sector–subsector combinations are invalid out of the 28 in the full cross-join.Geography–subsector relationship. Not every building type exists in every county. For example, San Francisco County (
06075) has no quick-service restaurants in this test dataset, and Denver County (08031) is missing both quick-service restaurants and small hotels.
These two relationships are different in character — the first is structural (sector–subsector pairing is fixed) and the second is data-driven (which buildings happen to exist in a county). This affects how we declare them.
Step 2: Choose Expected vs. Missing Associations¶
dsgrid gives you two complementary tools:
Expected associations — declare which combinations should have data. Everything not listed is assumed missing.
Missing associations — declare which combinations from the expected set don’t have data. Everything not listed is assumed present.
Guidelines for choosing:
Situation |
Best tool |
|---|---|
Most combinations are invalid (dataset is inherently sparse) |
Expected associations |
Most combinations are valid, with a few gaps |
Missing associations |
A structural relationship eliminates many combinations, plus a few edge-case gaps |
Expected associations for the structure, then missing associations for the edge cases |
For ComStock, the sector–subsector relationship eliminates half the cross-join, so expected associations are the right choice for that relationship. The geography–subsector gaps are a handful of edge cases on top of that, so missing associations handle those.
Step 3: Create the Expected Associations File¶
We need a file that lists every valid (sector, subsector) combination. Since these are the only two dimensions involved in this relationship, the file contains just those two columns. dsgrid will cross-join these combinations with the full set of records for every other validated dimension (here, geography).
Create a directory called expected_associations/ alongside your data files and add a CSV:
expected_associations/sector__subsector.csv
sector,subsector
com,com__LargeHotel
com,com__StripMall
com,com__QuickServiceRestaurant
com,com__MediumOffice
com,com__StandaloneRetail
com,com__FullServiceRestaurant
com,com__LargeOffice
com,com__SmallHotel
com,com__Warehouse
com,com__Hospital
com,com__SmallOffice
com,com__PrimarySchool
com,com__Outpatient
res,com__MidriseApartment
Notice the last row: res,com__MidriseApartment. Midrise apartments are modeled in ComStock (a commercial building stock tool) but reported under the residential sector. This is the kind of domain-specific detail that expected associations capture clearly.
Tip
Name the file after the dimensions it contains (e.g., sector__subsector.csv). This is not required by dsgrid but makes the purpose self-documenting.
Tip
Datasets are not limited to a single expected associations file. If there are multiple structural relationships, feel free to express those in separate files. See the how-to for how to reference multiple files in the dataset config.
What this achieves¶
Without expected associations, dsgrid would expect all 2 × 14 = 28 sector–subsector combinations. With this file, dsgrid expects only the 14 valid combinations, reducing the total from 8 × 28 = 224 to 8 × 14 = 112.
Step 4: Attempt Registration with Only Expected Associations¶
At this point your dataset config references expected_associations but not missing_associations. Run registration:
dsgrid registry datasets register dataset.json5 -l "Register ComStock dataset"
Registration will fail because the data is missing some expected combinations. dsgrid writes two sets of output files:
A single Parquet file (
<dataset_id>__missing_dimension_record_combinations.parquet) containing every missing combination. This is always generated, but can be very large.A directory of CSV files (
./missing_associations/) containing minimal patterns that characterize the gaps — for example,geography__subsector.csv. This structural format is produced by pattern analysis code written in Rust and is typically easier to review.
Step 5: Review and Reference the Missing Associations¶
Look at the generated missing_associations/geography__subsector.csv:
geography,subsector
06075,com__QuickServiceRestaurant
08031,com__QuickServiceRestaurant
08031,com__SmallHotel
36001,com__QuickServiceRestaurant
36119,com__SmallHotel
36119,com__Hospital
Each row says: “this building type legitimately has no data in this county.” For example, the modeled building stock contains no quick-service restaurants in San Francisco County (06075), Denver County (08031), or Albany County (36001).
Confirm that the gaps are legitimate and not data bugs. If anything looks wrong, fix the underlying data and retry from Step 4.
Once you are satisfied the missing combinations are correct, move the generated files to a permanent location (or leave them in place) and add a missing_associations entry to your dataset config pointing to them:
data_layout: {
table_format: "two_table",
value_format: "pivoted",
pivoted_dimension_type: "metric",
data_file: {
path: "load_data.csv",
},
lookup_data_file: {
path: "load_data_lookup.json",
},
expected_associations: [
"expected_associations",
],
missing_associations: [
"missing_associations",
],
}
Both fields accept a list of paths. Each path can be a file or a directory. When a directory is given, dsgrid reads all CSV and Parquet files in it.
The missing associations file contains only the dimensions involved in the relationship. dsgrid applies these as filters across all other validated dimensions — since sector is the only other validated dimension and each subsector maps to exactly one sector, these 6 rows each remove exactly one combination from the expected set.
Step 6: Re-run Registration¶
Run registration again:
dsgrid registry datasets register dataset.json5 -l "Register ComStock dataset"
dsgrid validates that the data contains exactly the expected combinations minus the declared missing combinations. If validation passes, both association declarations are stored in the registry alongside the dataset.
How the Math Works¶
Here is a summary of how dsgrid computes the valid set of dimension combinations for this example. Only non-time, non-trivial, non-pivoted dimensions participate — in this case geography, sector, and subsector.
Start from expected associations: The
sector__subsector.csvfile defines 14 valid sector–subsector combinations. Because expected associations are provided, dsgrid uses these directly instead of building the full 2 × 14 = 28 cross-join of sector and subsector.Cross-join with remaining dimensions: Geography is not covered by the expected associations file, so dsgrid cross-joins the 14 combinations with all 8 geographies: 14 × 8 = 112 expected combinations.
Subtract missing associations: The
geography__subsector.csvfile lists 6 geography–subsector combinations. Each removes exactly one row from the expected set, so 6 combinations are subtracted.Final expected count: 112 − 6 = 106 valid combinations. The data must contain exactly these 106 non-time dimension combinations.
Key Takeaways¶
Think about dimension relationships first. Before writing association files, identify which dimensions are not fully independent. This determines whether to use expected associations, missing associations, or both.
Use column subsets. Association files don’t need every dimension column. Include only the constrained dimensions — dsgrid cross-joins with the full set of records for omitted dimensions.
Expected associations define structure; missing associations handle edge cases. Use expected associations when large portions of the cross-join are invalid. Layer missing associations on top for the remaining gaps.
The iterative workflow is your friend. If you don’t know what’s missing, attempt registration and let dsgrid tell you. Review the output, fix any data bugs, and declare the legitimate gaps.