# Dataset Concepts A **dsgrid dataset** is a self-contained collection of metric data, typically the output of a single model or data source, together with a [dataset config](../../software_reference/data_models/dataset_model). The configuration file describes metadata, including data provenance, other attributes, dimensionality, and file format. Datasets typically cover a specific domain (e.g., buildings, electric vehicles, distributed generation, historical utility sales, etc.) and contribute to the larger energy picture assembled by a dsgrid project. ## Kinds of Datasets ### By Type The `dataset_type` field classifies the origin of the data: - **Modeled** — data generated an energy model (e.g., ComStock building energy simulations, TEMPO EV charging profiles, dGen distributed PV capacity projections). - **Historical** — real-world observed data from past measurements (e.g., EIA 861 utility customer sales). - **Benchmark** — reference datasets used for calibration or comparison (e.g., AEO energy by end-use projections). ### By Domain dsgrid projects typically assemble datasets from multiple domains to build a comprehensive picture of energy demand. Common examples include: - **Sector-level modeled energy use** — hourly building energy simulations covering end uses like heating, cooling, and lighting for residential or commercial building types (e.g., ResStock, ComStock); industrial energy demand projected by an integrated assessment model (e.g., GCAM) and downscaled using other data sources. - **Distributed energy resources** — capacity and hourly generation profiles for technologies such as rooftop solar PV (e.g., dGen). A single technology may produce two linked datasets: one for installed capacity and one for normalized generation profiles. - **Transportation electrification** — EV charging load profiles by vehicle type and charging level (e.g., TEMPO). - **Historical electricity data** — observed utility sales or grid load, often at annual or monthly resolution (e.g., EIA 861). - **Benchmark projections and growth factors** — reference-case energy projections or annual growth rates used to scale or calibrate modeled data (e.g., AEO end-use projections). ### By Data Qualifier The `dataset_qualifier_metadata` field describes the nature of the values: - **Quantity** (default) — absolute values such as energy consumption (kWh), capacity (kW), or count. - **Growth rate** — multiplicative factors applied to scale other datasets over time. Growth-rate datasets include additional metadata such as `growth_rate_type` (e.g., `exponential_annual`). ## Dataset Dimensions Each dataset defines values over up to eight [dimension types](dimension_concepts.md#dimension-types). A dataset config either defines dimensions inline (with records files) or references already-registered dimensions by ID. ### Shared vs. Dataset-Specific Dimensions In practice, some dimensions are often **shared across a project** because all contributing datasets need to align on the same elements: - **Geography** — spatial units (counties, states, census regions). Often project-defined, though datasets may use finer resolution (e.g., census tracts) and map to the project level. - **Scenario** — modeling scenarios (e.g., reference, high electrification). Usually project-defined. - **Model year** and **weather year** — typically project-defined. Other dimensions are more **dataset-specific** because they reflect the internal structure of a particular model: - **Subsector** — building types, industries, vehicle classes, etc. Each model has its own categorization. - **Metric** — measured quantities (energy end uses, capacity, charging profiles, population). Different models measure different things. - **Sector** — while the project defines the overall sectors (residential, commercial, industrial, transportation), individual datasets usually cover only one. ### Multiple Metric Types in a Project Different datasets in the same project may measure fundamentally different things — for example, energy consumption (kWh) vs. installed capacity (kW) vs. vehicle counts. In dsgrid, each metric type is a separate dataset with its own metric dimension records and [record class](dimension_concepts.md#dimension-record-classes). A project can define **multiple base dimensions of the same type** to accommodate this. For example, a project might have three metric base dimensions: one for energy end uses, one for DPV generation profiles, and one for DPV capacity. Each dataset is assigned to the appropriate metric dimension through the project's `required_dimensions` configuration. ### Trivial Dimensions Not all dataset dimensions are significant. For example, historical data will generally have a trivial (i.e., one-element) `scenario` dimension, and a single-sector dataset will have a trivial `sector` dimension. These one-element dimensions are called **trivial dimensions**. Trivial dimensions do not need to appear as columns in the data files — they are declared in the dataset config and added by dsgrid at runtime. This saves storage space and simplifies the data files. See [Trivial Dimensions](dimension_concepts.md#trivial-dimensions) for details. ### Inline Dimensions vs. Dimension References A dataset config specifies its dimensions in one of two ways: `dimensions` : Define a dimension **inline** by providing its records file and metadata directly in the config. dsgrid will automatically register the dimension during dataset registration. Use this when the dimension is unique to your dataset (e.g., a custom set of building subsectors or metric end uses). `dimension_references` : **Reference** an already-registered dimension by its `dimension_id` (a UUID assigned when the dimension was first registered), `dimension_type`, and `version`. Use this when you want to reuse a dimension from the project or from another dataset — for example, a shared geography or scenario dimension that the project admin has already registered. You do not need to look up UUIDs manually: `dsgrid registry datasets generate-config` automatically writes `dimension_references` entries for any dimensions it matches in the registry. You can mix both styles in the same config. A common pattern is to reference project-defined dimensions (geography, scenario, model year, weather year) and define dataset-specific dimensions inline (subsector, metric). ## How Datasets Relate to Projects Datasets are **standalone entities** — they can be registered independently of any project. A **project** assembles multiple datasets into a coherent whole by defining common base dimensions that all datasets must map onto. ### Three Dataset Operations Getting a dataset into a project involves up to three operations: 1. **Registration** — validates the dataset's internal consistency: schema, dimensions, and data completeness. The dataset becomes a versioned entity in the registry. No project is required. (`dsgrid registry datasets register`) 2. **Submission** — submits a registered dataset to a specific project. This step requires **dimension mappings** that align each dataset dimension to the corresponding project base dimension. dsgrid validates that the mappings are consistent and that the dataset provides all expected data points. (`dsgrid registry projects submit-dataset`) 3. **Combined register-and-submit** — performs both operations in a single command, which is convenient during iterative development. (`dsgrid registry projects register-and-submit-dataset`) Dimension mappings are often the most labor-intensive part of the process. A mapping defines how each dataset dimension record corresponds to one or more project dimension records — for example, mapping ComStock building types to the project's standard building categories, or aggregating census-tract geographies up to counties. See [Dataset Submitters](../../getting_started/dataset_submitters) for the full workflow and [How to Create Dataset Dimensions](../how_tos/how_to_dimensions) for guidance on dimension records. ## Configuration Options Most dataset config fields are self-explanatory or covered by the [schema reference](../../software_reference/data_models/dataset_model). Two boolean flags deserve additional explanation: `use_project_geography_time_zone` : When `true`, dsgrid derives each record's time zone from the **project's** geography dimension (which must include a `time_zone` column) rather than from the dataset's own geography records. Set this to `true` when your timestamps represent local time but your dataset's geography dimension does not include a `time_zone` column — for example, TEMPO and dGen datasets whose time values are local to the modeled location. When `false` (the default), the dataset's own geography records must provide the `time_zone` column. See [Time Formats](data_file_formats.md#time-formats) for details on how dsgrid handles time zones. `enable_unit_conversion` : When `true` (the default), dsgrid performs automatic unit conversion at query time by comparing the `unit` column in the dataset's metric dimension records with the corresponding project metric records. Set this to `false` only when the dataset's **dimension mapping** for the metric dimension already accounts for the unit difference through its mapping fractions. In that case, dsgrid's built-in conversion would double-count the scaling. ## File Format A dataset must comply with a supported dsgrid data file format. The main choices are: - **Table format**: [one-table](data_file_formats.md#one-table-format) or [two-table](data_file_formats.md#two-table-format) - **Value format**: [stacked or pivoted](data_file_formats.md#value-formats) See [Data File Formats](data_file_formats) for requirements, recommendations, and detailed examples. ## Examples The [dsgrid-StandardScenarios repository](https://github.com/dsgrid/dsgrid-project-StandardScenarios/tree/main/dsgrid_project/datasets) contains dataset configs that illustrate a range of domains and formats: ### Historical - [EIA 861 Utility Customer Sales (MWh) by State by Sector by Year for 2010-2020](https://github.com/dsgrid/dsgrid-project-StandardScenarios/blob/main/dsgrid_project/datasets/historical/eia_861_annual_energy_use_state_sector/dataset.json5) — annual historical utility sales data; one-table stacked format. ### Modeled - [ResStock](https://github.com/dsgrid/dsgrid-project-StandardScenarios/blob/main/dsgrid_project/datasets/modeled/resstock/dataset.json5) — hourly residential building energy simulations; two-table pivoted on metric. - [ComStock](https://github.com/dsgrid/dsgrid-project-StandardScenarios/blob/main/dsgrid_project/datasets/modeled/comstock/dataset.json5) — hourly commercial building energy simulations; two-table pivoted on metric. - [TEMPO](https://github.com/dsgrid/dsgrid-project-StandardScenarios/blob/main/dsgrid_project/datasets/modeled/tempo/dataset.json5) — EV charging load profiles; two-table pivoted on metric with representative-period time. ### Benchmark / Growth Factors - [AEO 2021 Reference Case Residential Energy End Use Annual Growth Factors](https://github.com/dsgrid/dsgrid-project-StandardScenarios/blob/main/dsgrid_project/datasets/modeled/aeo2021_reference/residential/End_Use_Growth_Factors/dataset.json5) — unitless growth rates; one-table pivoted on metric. :::{seealso} - [Dataset Data Model](../../software_reference/data_models/dataset_model) — full schema reference for the dataset config - [Data File Formats](data_file_formats) — file format requirements and examples - [Dimension Concepts](dimension_concepts) — dimension types, records, and record classes - [Dataset Submitters](../../getting_started/dataset_submitters) — step-by-step workflow for registration and submission :::