Create a Derived Dataset¶
In this tutorial you will learn how to query a dsgrid project to produce and register a derived dataset. The tutorial uses the comstock_conus_2022_projected derived dataset from dsgrid-project-StandardScenarios as an example.
You can run all commands in this tutorial except the last one on NLR’s HPC Kestrel cluster (the dataset is already registered).
Steps¶
SSH to a login node to begin the tutorial.
Step 1: Set Up Your Spark Cluster¶
You will need at least four compute nodes to create this dataset in about an hour. Follow the instructions at Run dsgrid on Kestrel if you have not already done so. The instructions now assume that you are logged in to the compute node that is running the Spark master process.
Step 2: Copy the Query File¶
Copy the query file for this derived dataset from github.
Step 3: Set Environment Variables¶
Set these environment variables to avoid repeated typing:
Note
The value of 2400 for NUM_PARTITIONS is based on observations from processing this ~1 TB dataset.
export SPARK_CLUSTER=spark://$(hostname):7077
export QUERY_OUTPUT_DIR=query-output
export DSGRID_CLI=$(which dsgrid-cli.py)
export NUM_PARTITIONS=2400
Step 4: Create the Derived Dataset¶
Create the comstock_conus_2022_projected dataset. The comstock_conus_2022_reference dataset has load data for a single year. This query applies the aeo2021_reference_commercial_energy_use_growth_factors dataset to project the load values through the model year 2050.
spark-submit \
--master ${SPARK_CLUSTER} \
--conf spark.sql.shuffle.partitions=${NUM_PARTITIONS} \
${DSGRID_CLI} \
query \
project \
run \
comstock_conus_2022_projected.json5 \
-o ${QUERY_OUTPUT}
Step 5: Create Derived Dataset Config Files¶
Generate the configuration files for the derived dataset:
spark-submit \
--master ${SPARK_CLUSTER} \
--conf spark.sql.shuffle.partitions=${NUM_PARTITIONS} \
${DSGRID_CLI} \
query \
project \
create-derived-dataset-config \
${QUERY_OUTPUT}/comstock_conus_2022_projected \
comstock-dd
Step 6: Edit Configuration Files¶
Edit the output files in comstock-dd as desired.
Step 7: Register the Derived Dataset¶
Register the derived dataset with the dsgrid registry:
spark-submit \
--master ${SPARK_CLUSTER} \
--conf spark.sql.shuffle.partitions=${NUM_PARTITIONS} \
${DSGRID_CLI} \
registry \
datasets \
register \
comstock-dd/dataset.json5 \
${QUERY_OUTPUT}/comstock_conus_2022_projected \
-l Register_comstock_conus_2022_projected
Step 8: Submit to the Project¶
Submit the derived dataset to the project:
spark-submit \
--master ${SPARK_CLUSTER} \
--conf spark.sql.shuffle.partitions=${NUM_PARTITIONS} \
${DSGRID_CLI} \
registry \
projects \
submit-dataset \
-p dsgrid_conus_2022 \
-d comstock_conus_2022_projected \
-r comstock-dd/dimension_mapping_references.json5 \
-l Submit_comstock_conus_2022_projected
Next Steps¶
Learn about derived dataset concepts
Understand query processing in more detail
Explore querying project data for analysis