# Create a Derived Dataset In this tutorial you will learn how to query a dsgrid project to produce and register a derived dataset. The tutorial uses the comstock_conus_2022_projected derived dataset from [dsgrid-project-StandardScenarios](https://github.com/dsgrid/dsgrid-project-StandardScenarios) as an example. You can run all commands in this tutorial except the last one on NLR's HPC Kestrel cluster (the dataset is already registered). ## Steps SSH to a login node to begin the tutorial. ### Step 1: Set Up Your Spark Cluster You will need at least four compute nodes to create this dataset in about an hour. Follow the instructions at [Run dsgrid on Kestrel](../how_tos/run_on_kestrel) if you have not already done so. The instructions now assume that you are logged in to the compute node that is running the Spark master process. ### Step 2: Copy the Query File Copy the query file for this derived dataset from [github](https://github.com/dsgrid/dsgrid-project-StandardScenarios/blob/main/dsgrid_project/derived_datasets/comstock_conus_2022_projected.json5). ### Step 3: Set Environment Variables Set these environment variables to avoid repeated typing: :::{note} The value of 2400 for NUM_PARTITIONS is based on observations from processing this ~1 TB dataset. ::: ```bash export SPARK_CLUSTER=spark://$(hostname):7077 export QUERY_OUTPUT_DIR=query-output export DSGRID_CLI=$(which dsgrid-cli.py) export NUM_PARTITIONS=2400 ``` ### Step 4: Create the Derived Dataset Create the `comstock_conus_2022_projected` dataset. The `comstock_conus_2022_reference` dataset has load data for a single year. This query applies the `aeo2021_reference_commercial_energy_use_growth_factors` dataset to project the load values through the model year 2050. ```bash spark-submit \ --master ${SPARK_CLUSTER} \ --conf spark.sql.shuffle.partitions=${NUM_PARTITIONS} \ ${DSGRID_CLI} \ query \ project \ run \ comstock_conus_2022_projected.json5 \ -o ${QUERY_OUTPUT} ``` ### Step 5: Create Derived Dataset Config Files Generate the configuration files for the derived dataset: ```bash spark-submit \ --master ${SPARK_CLUSTER} \ --conf spark.sql.shuffle.partitions=${NUM_PARTITIONS} \ ${DSGRID_CLI} \ query \ project \ create-derived-dataset-config \ ${QUERY_OUTPUT}/comstock_conus_2022_projected \ comstock-dd ``` ### Step 6: Edit Configuration Files Edit the output files in `comstock-dd` as desired. ### Step 7: Register the Derived Dataset Register the derived dataset with the dsgrid registry: ```bash spark-submit \ --master ${SPARK_CLUSTER} \ --conf spark.sql.shuffle.partitions=${NUM_PARTITIONS} \ ${DSGRID_CLI} \ registry \ datasets \ register \ comstock-dd/dataset.json5 \ ${QUERY_OUTPUT}/comstock_conus_2022_projected \ -l Register_comstock_conus_2022_projected ``` ### Step 8: Submit to the Project Submit the derived dataset to the project: ```bash spark-submit \ --master ${SPARK_CLUSTER} \ --conf spark.sql.shuffle.partitions=${NUM_PARTITIONS} \ ${DSGRID_CLI} \ registry \ projects \ submit-dataset \ -p dsgrid_conus_2022 \ -d comstock_conus_2022_projected \ -r comstock-dd/dimension_mapping_references.json5 \ -l Submit_comstock_conus_2022_projected ``` ## Next Steps - Learn about [derived dataset concepts](../project_derived_datasets/concepts) - Understand [query processing](../project_queries/concepts) in more detail - Explore [querying project data](query_project) for analysis