dsgrid documentation

What is dsgrid?

The dsgrid Python package is the central tool for creating, managing, contributing to, and accessing demand-side grid (dsgrid) toolkit projects. The dsgrid toolkit enables the compilation of high-resolution load datasets suitable for forward-looking power system and other analyses. For more information and completed work products, please see https://www.nrel.gov/analysis/dsgrid.html.

Documentation Overview

If you are new to dsgrid, you’ll likely want to start by reading the rest of this page, reading the how-to guide on installation, and then choosing a tutorial that corresponds to how you expect to be using dsgrid in the near future.

For general use, the documentation is organized into:

  • Tutorials, which provide step-by-step instructions (via examples) for common high-level tasks;

  • How-To Guides, which provide quick reminder recipes for key workflows;

  • Explanations, which describe concepts and answer “why” questions to facilitate deeper understanding; and

  • Reference, which provides complete information on various interfaces (e.g., command line, data formats, data models, public Python API).

Please note that for now:

⚠️ dsgrid is under active development and does not yet have a formal package release. ⚠️

and details listed here are subject to change. Please reach out to the dsgrid coordination team with any questions or other feedback.

dsgrid Overview

dsgrid is a tool for collecting and aligning datasets containing timeseries information describing future energy use, especially electricity load, to be used in planning studies. Datasets are defined over specific scenario, model year, weather year, geography, time, sector, subsector, and metric Dimensions and can range in size from less than 1 megabyte to over 1 terabyte. Typically, datasets are organized into Projects with specific base dimensions with the help of Dimension Mappings. Projects use Queries to consolidate information into Derived Datasets that, together with the standalone datasets, eventually enable a comprehensive description of the electricity load or other energy use being modeled. Projects can also be queried to produce output data for ingestion into another model or for direct analysis.

The people who interact with dsgrid are typically:

Because the dsgrid data can be quite large (on the order of terabytes), and are compiled from a variety of data sources, dsgrid uses two key technologies to facilitate its workflows:

  • A graph database (currently ArangoDB) to hold dsgrid registries (metadata on and relationships between dsgrid components, e.g., dimensions, datasets, dimension mappings, projects, derived datasets, queries)

  • A big-data engine (currently Apache Spark) to perform database operations across a cluster of computational nodes

Both of these technologies mean that some care must be taken when choosing and setting up a computational environment for a particular task. The dsgrid coordination team currently supports two computational environments:

  • Standalone, single-node environments e.g., personal laptops or a single server or virtual machine, which is suitable for:

    • Small-scale development and testing

    • Submitting a single dataset to a project in offline mode

    • Very small dsgrid projects (no large datasets and little to no downscaling)

    Note that use on standalone Windows machines is especially limited.

  • NREL High Performance Computing, where users can:

    • Directly work on projects through the shared registry

    • Develop code and test data and workflows using their own registry

    In either case, users will need to launch Apache Spark clusters with sufficient computational, memory, and disk resources.

Components

Dimensions

dsgrid datasets and projects are multi-dimensional, and some dimensions are defined over thousands of elements (e.g., counties in the United States, hours in a year). It is also typical for different datasets that nominally describe the same thing to use different labels (e.g., ResStock building types and EIA Residential Energy Consumption Survey (RECS) building types).

To manage this complexity while allowing different analysts and modeling teams to use their own labels (to facilitate transparency, easy maintenance and debugging), dsgrid requires its users to explicitly define each data dimension by specifying its dimension type, metadata, and a table listing each dimension record’s id and name. dsgrid uses this information to ensure that submitted data is as-expected.

Datasets

A dsgrid dataset describes energy use or another metric (e.g., population, stock of certain assets, delivered energy service, growth rates) resolved over a number of different dimensions, e.g., scenario, geography, time, etc. When registering a dataset, the data submitter must define a dimension for each dsgrid dimension type, but that does not mean that datasets are required to be resolved (i.e., have multiple entries) for each dimension type–any dimension can be “trivial”, in which case it is defined by a single record (e.g., ‘unspecified’ subsector, or a single ‘2012’ weather year) and is not included in the data files.

Registering a dataset requires a dataset config file, which lists dataset and dimension metadata, the actual data file(s), and dimension record files (csvs) for any dimensions not already in the dsgrid registry. The data files must conform to one of the dsgrid Dataset Formats, currently either the One Table Format or the Two Table Format (Standard). Upon registration, dsgrid checks the data files, which contain dimension records and numerical values, for consistency with the specified dimensions. Inconsistent data fails registration to prevent compounding downstream errors.

Projects

A dsgrid project is a collection of datasets that describe energy demand for a specific region over a specific timeframe. Because datasets describing different sectors’ energy use are defined in different ways, the key task of a dsgrid project is to enable and perform mappings from datasets’ dimensions into the project base dimensions. Project base dimensions are defined in the same way as dataset dimensions (i.e., there is a project base dimension for each dimension type), however, they are used differently. Whereas dataset dimensions are descriptive–they describe what you will find if you look in a dataset’s data files, project base dimensions are prescriptive–they define what dataset submitters must map their dimensions into. dsgrid projects are also highly prescriptive about what datasets they expect to be submitted, and what data dimensions they are expecting each dataset to provide (post-mapping, see Dimension Mappings). dsgrid uses all of the prescribed information to check that submitted datasets are as expected, and throws errors if they are not.

dsgrid projects are also the starting point for Queries. Queries are the process whereby datasets are actually mapped into the project base dimensions, concatenated, and further transformed. The two straightforward applications of queries are:

  1. Output data suitable for use in another model, and

  2. Analyze project data.

Either way, users generally do not want full detail along all dimensions. dsgrid supports this by letting users specify what supplemental dimensions they want their query results in, as well as if they want any dimension records filtered out and what mathematical operations they want to use for aggregations. For example, a query can be written to filter out non-electricity energy use; sum electricity use over all sectors, subsectors, and end-uses; and map to a specific power system model’s geography to create load data for capacity expansion or production cost modeling.

Dimension Mappings

While many data sources provide information by, e.g., scenario, geographic place, and sector, different data sources often define such dimensions differently and/or simply report out at a different level of resolution. Because dsgrid joins many datasets together to create a coherent description of energy for a specific place over a spectific timeframe, we need a mechanism for reconciling these differences. For example:

  • How should census division data be downscaled to counties?

  • What’s the best mapping between EIA AEO commercial building types and NREL ComStock commercial building types?

  • Residential, res, and Res. should all be interpreted the same way, as referring to residential energy use or housing stock, etc.

The mappings that answer these questions are explicitly registered with dsgrid as dimension mappings. This way they are clearly documented and usable in automated queries. Explicit, programmatically checked and used dimensions and dimension mappings are key features that help dsgrid efficiently and reliably assemble detailed datasets of energy demand from a combination of historical and modeled data.

Currently, dsgrid supports two different types of mappings:

  1. Dataset-to-Project: These are mappings from a dataset’s dimension to a project’s base dimension of the same dimension type. They are declared and registered when a dataset is submitted to a project.

  2. Base-to-Supplemental: These are mappings from a project’s base dimensions to its supplemental dimensions, which are the alternate data resolutions available for use in queries. Base-to-supplemental dimensions are defined when registering a project.

Queries

TODO

Derived Datasets

During the creation of a dsgrid project, one of the key tasks is to use queries to create derived datasets. As their name implies, derived datasets are dsgrid datasets and must meet all requirements that status implies, but they are created by combining multiple datasets already in the project. Because they are formed from datasets already mapped to the project base dimensions, defining the dimensions of derived datasets is typically straightforward and is either fully or mostly automated by dsgrid. Derived datasets are the mechanism dsgrid provides to do things like apply growth rates or calculate residuals. Registering derived datasets with the project enables dsgrid modelers to bootstrap data into providing a complete and straightforward accounting of a region’s energy demand over a specified timeframe.

Published Projects

TODO

Tasks

Project Coordinators

dsgrid project coordinators create projects, collaborate with dataset contributors to get datasets added to the project, create derived datasets, write queries, analyze and publish data.

Dataset Contributors

The role of dataset contributors is primarily to create, register, and submit datasets. Of course, dataset contributors might also be project coordinators and/or data users.

Data Users

Data users might access data provided by a project coordinator through formal publication or other dissemination channels, or they might write their own queries.

Computational Environment

dsgrid is cross-platform software that can be used with any datasets that conform to the metadata and data formatting requirements. However, the typical dsgrid project involves large, simulated datasets that are “exploded” to align across multiple dimension types and to produce the variety of different views needed by different data users and analysts. Thus while some development, testing, and work with single datasets or very small projects can be performed on Standalone machines, currently a lot of dsgrid work requires NREL High Performance Computing.

Standalone

TODO: List the key set-up tasks (spin up and connect to ArangoDB, spin up and connect to Apache Spark cluster, and generally configure dsgrid to be pointing to all the right places) and link to the appropriate how-tos

NREL High Performance Computing

TODO: List the key set-up tasks (spin up and connect to ArangoDB, spin up and connect to Apache Spark cluster, and generally configure dsgrid to be pointing to all the right places) and link to the appropriate how-tos

Indices, Tables, and Contents