Create and Submit a DatasetΒΆ
In this tutorial you will learn how to create a dsgrid dataset and register it with a project by following the example of the TEMPO dataset from dsgrid-project-StandardScenarios.
Note
It is possible to register a dataset without connecting it to a project. The main difference is that the information about dimension mappings below is not relevant.
Dataset attributes and dimensions must be defined in a dataset config as defined by Dataset Config. This tutorial will give step-by-step instructions on how to assign values in this file.
Create a
dataset.json5
with these fields (but choose your own values):
{
dataset_id: "tempo_conus_2022",
dataset_type: "modeled",
data_source: "tempo",
sector_description: "transportation",
description: "CONUS 2022 TEMPO county-level results for PEV charging.",
origin_creator: "Arthur Yip",
origin_organization: "NREL",
origin_contributors: [
"Arthur Yip",
"Brian Bush",
"Christopher Hoehne",
"Paige Jadun",
"Catherine Ledna",
"Matteo Muratori",
],
origin_project: "dsgrid CONUS 2022",
origin_date: "Dec 21 2021",
origin_version: "dsgrid",
source: "https://github.nrel.gov/mmurator/TMM/commit/54f715e0fe015a578dd0594c7031c046cfe27907",
data_classification: "low",
tags: [
"transportation",
"conus-2022",
"reference",
"tempo",
],
}
Choose a data format as described in Dataset Formats. Please note all requirements listed on that page.
This dataset uses the two-table (standard) format with dimension information stored like this:
Time: Representative period format with hourly data for one week per month.
Metric: record IDs are pivoted in the data table.
Geography: Stored in lookup table.
Subsector: Stored in lookup table.
Model Year: Stored in lookup table.
Scenario: Stored in lookup table.
Sector: Trivial
Weather Year: Trivial
Add the relevant settings to your dataset.json5.
data_schema_type: "standard",
trivial_dimensions: [
"sector",
"weather_year",
],
// The time in this dataset has no time zone. It is based on the local time perceived by the
// people being modeled. dsgrid will map times to the project's geography time zone.
use_project_geography_time_zone: true,
data_schema: {
table_format: {
format_type: "pivoted",
pivoted_dimension_type: "metric",
},
},
Identify the dimension records required by the project. Here is what is listed in the
project.json5
:
required_dimensions: {
single_dimensional: {
sector: {
base: ["trans"],
},
subsector: {
supplemental: [
{
name: "Subsectors by Sector Collapsed",
record_ids: ["transportation_subsectors"],
},
],
},
metric: {
supplemental: [
{
name: "transportation-end-uses-collapsed",
record_ids: ["transportation_end_uses"],
},
],
},
}
}
Follow the instructions at Project Viewer to start the viewer. Once you have the data table
loaded in your browser, type transportation
in the filter row of the dimension_query_name
column and then select transportation_subsectors
or transportation_end_uses
. The dimension
record table will get populated with record IDs.
If you prefer working in an interactive Python session, here is example code to do the same thing:
In [1]: from dsgrid.dimension.base_models import DimensionType
...: from dsgrid.registry.registry_manager import RegistryManager
...: from dsgrid.registry.registry_database import DatabaseConnection
...:
...: manager = RegistryManager.load(
...: DatabaseConnection(
...: hostname="dsgrid-registry.hpc.nrel.gov",
...: database="standard-scenarios",
...: )
...: )
...: project = manager.project_manager.load_project("dsgrid_conus_2022")
...: project.config.get_dimension_records("transportation_end_uses").show()
...: project.config.get_dimension_records("transportation_subsectors").show()
+-------------------+--------------------+-----------+----+
| id| name| fuel_id|unit|
+-------------------+--------------------+-----------+----+
|electricity_ev_l1l2|Electricity EV L1/L2|electricity| kWh|
|electricity_ev_dcfc| Electricty EV DCFC|electricity| kWh|
+-------------------+--------------------+-----------+----+
+------------+------------+
| id| name|
+------------+------------+
| bev_compact| Bev Compact|
| bev_midsize| Bev Midsize|
| bev_pickup| Bev Pickup|
| bev_suv| Bev Suv|
|phev_compact|Phev Compact|
|phev_midsize|Phev Midsize|
| phev_pickup| Phev Pickup|
| phev_suv| Phev Suv|
+------------+------------+
Alteratively, you can browse the source files, such as this records file.
Add dimension configs to the
dimensions
section of yourdataset.json5
for each dimension that is unique for your dataset. If you use a dimension from the project or another dataset, add its dimension ID to thedimension_references
section of the file. For example,
dimensions: [
{
"class": "County",
type: "geography",
name: "ACS County 2018",
display_name: "County",
file: "dimensions/counties.csv",
description: "American Community Survey US counties, 2018.",
},
]
Create dimension mappings for all dimensions that are different than the project. Add mappings to
dimension_mappings.json5
and records todimension_mappings/<your-mapping>.csv
. Here are two examples.The TEMPO dataset uses a superset of county records compared to the project (it includes Alaska and Hawaii). The counties in common have the same IDs. Here is the resulting dimension mapping metadata and records. All IDs that exist in TEMPO but not the project have a
null
entry forto_id
.
{
description: "ACS County 2018 to ACS County 2020 L48",
file: "dimension_mappings/county_to_county.csv",
dimension_type: "geography",
mapping_type: "many_to_one_aggregation",
},
Records file snippet::
from_id,to_id
01001,01001
01003,01003
01005,01005
01007,01007
02013,
02016,
- The TEMPO dataset projects electricity load from 2018 to 2050 with only even years. The
project expects model years from 2010 to 2050. The TEMPO dataset uses this mapping to meet the
project requirements.
{
description: "2010-2050 from interpolating for every other year and 0 for 2010-2017",
dimension_type: "model_year",
file: "dimension_mappings/model_year_to_model_year.csv",
mapping_type: "many_to_many_explicit_multipliers",
},
Records file snippet:
from_id,to_id,from_fraction
2018,2010,0
2018,2011,0
2018,2012,0
2018,2013,0
2018,2014,0
2018,2015,0
2018,2016,0
2018,2017,0
2018,2018,1
2018,2019,0.5
2020,2019,0.5
2020,2020,1
2020,2021,0.5
2022,2021,0.5
Create
load_data.parquet
. This data table includes time columns (day_of_week
,hour
,month
) and metric columns (L1andL2
andDCFC
). Other dimensions will go into theload_data_lookup.parquet
. Each unique time array needs to have a uniqueid
. The TEMPO team decided to encode internal information into specific bytes of each value, but that is optional. Other datasets use 1 to N.Refer to Dataset Formats for guidance about partitions.
>>> spark.read.parquet("tempo_conus_2022/1.0.0/load_data.parquet").show()
+-----------+----+-----+---------+---------+---------+
|day_of_week|hour|month| L1andL2| DCFC| id|
+-----------+----+-----+---------+---------+---------+
| 0| 0| 12|484.81393|405.39902|109450511|
| 0| 1| 12|150.94759| 0.0|109450511|
| 0| 2| 12| 0.0| 0.0|109450511|
| 0| 3| 12| 0.0| 0.0|109450511|
| 0| 4| 12| 0.0| 0.0|109450511|
| 0| 5| 12| 0.0| 0.0|109450511|
| 0| 6| 12| 0.0| 0.0|109450511|
| 0| 7| 12| 0.0| 0.0|109450511|
| 0| 8| 12| 0.0| 0.0|109450511|
| 0| 9| 12| 0.0| 0.0|109450511|
| 0| 10| 12| 0.0| 0.0|109450511|
| 0| 11| 12| 0.0| 0.0|109450511|
| 0| 12| 12|312.24542| 0.0|109450511|
| 0| 13| 12| 270.221| 0.0|109450511|
| 0| 14| 12|180.36609| 0.0|109450511|
| 0| 15| 12|1078.6263| 0.0|109450511|
| 0| 16| 12| 656.5123| 0.0|109450511|
| 0| 17| 12|1092.3519| 0.0|109450511|
| 0| 18| 12| 959.8675| 0.0|109450511|
| 0| 19| 12| 841.9459| 0.0|109450511|
+-----------+----+-----+---------+---------+---------+
Create
load_data_lookup.parquet
. Theid
column should match the values inload_data.parquet
so that a single table can be produced by joining the two tables on that column. If the dataset is missing data for specific dimension combinations, include a row for each combination and setid
tonull
.
Warning
All dimension columns must be strings, including columns that look like numbers, such
as model_year
.
Warning
If your dataset uses FIPS county codes, be sure to not inadvertently drop leading zeros.
>>> spark.read.parquet("tempo_conus_2022/1.0.0/load_data_lookup.parquet").show()
+---------+--------------------+----------+--------+------------------+
|geography| subsector|model_year| id| scenario|
+---------+--------------------+----------+--------+------------------+
| 06085|Single_Driver+Low...| 2022| 1060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022| 2060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022| 3060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022| 4060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022| 5060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022| 6060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022| 7060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022| 8060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022| 9060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|10060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|11060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|12060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|13060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|14060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|15060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|16060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|17060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|18060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|19060853|ldv_sales_evs_2035|
| 06085|Single_Driver+Low...| 2022|20060853|ldv_sales_evs_2035|
+---------+--------------------+----------+--------+------------------+
Register and submit the dataset. This requires a properly-configured Spark cluster because of the data size. Smaller datasets may succeed with Spark in local mode. Refer to Apache Spark to setup a Spark cluster.
This command assumes that
dataset.json5
,dimension_mappings.json5
, and the directory containingload_data.parquet
andload_data_lookup.parquet
are in a directory calledbase_dir
.When running this command dsgrid will perform numerous validations in order to verify dataset consistency and that the project requirements are met. It may take up to an hour on an HPC compute node.
TODO: offline mode for verification, online mode for the final registration.
$ spark-submit --master=spark://<master_hostname>::7077 $(which dsgrid-cli.py) registry \
projects \
register-and-submit-dataset \
--project-id dsgrid_conus_2022 \
--dimension-mapping-file base_dir/dimension_mappings.json5 \
--log-message "Register and submit TEMPO dataset" \
base_dir/dataset.json5 \
base_dir/tempo_load_data