Dataset Concepts¶

A dsgrid dataset is a self-contained collection of metric data, typically the output of a single model or data source, together with a dataset config. The configuration file describes metadata, including data provenance, other attributes, dimensionality, and file format. Datasets typically cover a specific domain (e.g., buildings, electric vehicles, distributed generation, historical utility sales, etc.) and contribute to the larger energy picture assembled by a dsgrid project.

Kinds of Datasets¶

By Type¶

The dataset_type field classifies the origin of the data:

Modeled — data generated an energy model (e.g., ComStock building energy simulations, TEMPO EV charging profiles, dGen distributed PV capacity projections).
Historical — real-world observed data from past measurements (e.g., EIA 861 utility customer sales).
Benchmark — reference datasets used for calibration or comparison (e.g., AEO energy by end-use projections).

By Domain¶

dsgrid projects typically assemble datasets from multiple domains to build a comprehensive picture of energy demand. Common examples include:

Sector-level modeled energy use — hourly building energy simulations covering end uses like heating, cooling, and lighting for residential or commercial building types (e.g., ResStock, ComStock); industrial energy demand projected by an integrated assessment model (e.g., GCAM) and downscaled using other data sources.
Distributed energy resources — capacity and hourly generation profiles for technologies such as rooftop solar PV (e.g., dGen). A single technology may produce two linked datasets: one for installed capacity and one for normalized generation profiles.
Transportation electrification — EV charging load profiles by vehicle type and charging level (e.g., TEMPO).
Historical electricity data — observed utility sales or grid load, often at annual or monthly resolution (e.g., EIA 861).
Benchmark projections and growth factors — reference-case energy projections or annual growth rates used to scale or calibrate modeled data (e.g., AEO end-use projections).

By Data Qualifier¶

The dataset_qualifier_metadata field describes the nature of the values:

Quantity (default) — absolute values such as energy consumption (kWh), capacity (kW), or count.
Growth rate — multiplicative factors applied to scale other datasets over time. Growth-rate datasets include additional metadata such as growth_rate_type (e.g., exponential_annual).

Dataset Dimensions¶

Each dataset defines values over up to eight dimension types. A dataset config either defines dimensions inline (with records files) or references already-registered dimensions by ID.

Shared vs. Dataset-Specific Dimensions¶

In practice, some dimensions are often shared across a project because all contributing datasets need to align on the same elements:

Geography — spatial units (counties, states, census regions). Often project-defined, though datasets may use finer resolution (e.g., census tracts) and map to the project level.
Scenario — modeling scenarios (e.g., reference, high electrification). Usually project-defined.
Model year and weather year — typically project-defined.

Other dimensions are more dataset-specific because they reflect the internal structure of a particular model:

Subsector — building types, industries, vehicle classes, etc. Each model has its own categorization.
Metric — measured quantities (energy end uses, capacity, charging profiles, population). Different models measure different things.
Sector — while the project defines the overall sectors (residential, commercial, industrial, transportation), individual datasets usually cover only one.

Multiple Metric Types in a Project¶

Different datasets in the same project may measure fundamentally different things — for example, energy consumption (kWh) vs. installed capacity (kW) vs. vehicle counts. In dsgrid, each metric type is a separate dataset with its own metric dimension records and record class.

A project can define multiple base dimensions of the same type to accommodate this. For example, a project might have three metric base dimensions: one for energy end uses, one for DPV generation profiles, and one for DPV capacity. Each dataset is assigned to the appropriate metric dimension through the project’s required_dimensions configuration.

Trivial Dimensions¶

Not all dataset dimensions are significant. For example, historical data will generally have a trivial (i.e., one-element) scenario dimension, and a single-sector dataset will have a trivial sector dimension. These one-element dimensions are called trivial dimensions.

Trivial dimensions do not need to appear as columns in the data files — they are declared in the dataset config and added by dsgrid at runtime. This saves storage space and simplifies the data files. See Trivial Dimensions for details.

Inline Dimensions vs. Dimension References¶

A dataset config specifies its dimensions in one of two ways:

dimensions: Define a dimension inline by providing its records file and metadata directly in the config. dsgrid will automatically register the dimension during dataset registration. Use this when the dimension is unique to your dataset (e.g., a custom set of building subsectors or metric end uses).
dimension_references: Reference an already-registered dimension by its dimension_id (a UUID assigned when the dimension was first registered), dimension_type, and version. Use this when you want to reuse a dimension from the project or from another dataset — for example, a shared geography or scenario dimension that the project admin has already registered. You do not need to look up UUIDs manually: dsgrid registry datasets generate-config automatically writes dimension_references entries for any dimensions it matches in the registry.

You can mix both styles in the same config. A common pattern is to reference project-defined dimensions (geography, scenario, model year, weather year) and define dataset-specific dimensions inline (subsector, metric).

How Datasets Relate to Projects¶

Datasets are standalone entities — they can be registered independently of any project. A project assembles multiple datasets into a coherent whole by defining common base dimensions that all datasets must map onto.

Three Dataset Operations¶

Getting a dataset into a project involves up to three operations:

Registration — validates the dataset’s internal consistency: schema, dimensions, and data completeness. Data completeness can be customized via expected_associations and missing_associations (see How to Handle Dimension Associations). The dataset becomes a versioned entity in the registry. No project is required. (dsgrid registry datasets register)
Submission — submits a registered dataset to a specific project. This step requires dimension mappings that align each dataset dimension to the corresponding project base dimension. dsgrid validates that the mappings are consistent and that the dataset provides all expected data points. (dsgrid registry projects submit-dataset)
Combined register-and-submit — performs both operations in a single command, which is convenient during iterative development. (dsgrid registry projects register-and-submit-dataset)

Dimension mappings are often the most labor-intensive part of the process. A mapping defines how each dataset dimension record corresponds to one or more project dimension records — for example, mapping ComStock building types to the project’s standard building categories, or aggregating census-tract geographies up to counties.

See Dataset Submitters for the full workflow and How to Create Dataset Dimensions for guidance on dimension records.

Configuration Options¶

Most dataset config fields are self-explanatory or covered by the schema reference. Two boolean flags deserve additional explanation:

use_project_geography_time_zone: When true, dsgrid derives each record’s time zone from the project’s geography dimension (which must include a time_zone column) rather than from the dataset’s own geography records. Set this to true when your timestamps represent local time but your dataset’s geography dimension does not include a time_zone column — for example, TEMPO and dGen datasets whose time values are local to the modeled location. When false (the default), the dataset’s own geography records must provide the time_zone column. See Time Formats for details on how dsgrid handles time zones.
enable_unit_conversion: When true (the default), dsgrid performs automatic unit conversion at query time by comparing the unit column in the dataset’s metric dimension records with the corresponding project metric records. Set this to false only when the dataset’s dimension mapping for the metric dimension already accounts for the unit difference through its mapping fractions. In that case, dsgrid’s built-in conversion would double-count the scaling.

File Format¶

A dataset must comply with a supported dsgrid data file format. The main choices are:

Table format: one-table or two-table
Value format: stacked or pivoted

See Data File Formats for requirements, recommendations, and detailed examples.

Examples¶

The dsgrid-StandardScenarios repository contains dataset configs that illustrate a range of domains and formats:

Historical¶

EIA 861 Utility Customer Sales (MWh) by State by Sector by Year for 2010-2020 — annual historical utility sales data; one-table stacked format.

Modeled¶

ResStock — hourly residential building energy simulations; two-table pivoted on metric.
ComStock — hourly commercial building energy simulations; two-table pivoted on metric.
TEMPO — EV charging load profiles; two-table pivoted on metric with representative-period time.

Benchmark / Growth Factors¶

AEO 2021 Reference Case Residential Energy End Use Annual Growth Factors — unitless growth rates; one-table pivoted on metric.