Query a Project

In this tutorial you will learn how to query a dsgrid project for aggregated data. The query will use data from the dsgrid registry stored on NREL’s HPC Kestrel cluster.

Query objectives

This query will accomplish the following:

  • Read data from the datasets tempo_conus_2022_mapped, resstock_conus_2022_projected, and comstock_conus_2022_projected.

  • Filter data for only model years 2030, 2040, and 2050.

  • Aggregate county-level load data into state-level.

  • Aggregate load data by fuel type.

  • Aggregate hourly data into annual data.

  • Drop the subsector dimension.

Steps

ssh to a login node to begin the tutorial.

  1. Follow the instructions at How to Run dsgrid on Kestrel if you have not already done so.

  2. Add this content to a file called query.json5:

Note

There is a CLI command that can generate a query file for your project. Refer to dsgrid query project create --help.

{
  "name": "load-per-state",
  "version": "0.1.0",
  "project": {
    "project_id": "dsgrid_conus_2022",
    "dataset": {
      "dataset_id": "load-per-state",
      "source_datasets": [
        {
          "dataset_id": "comstock_conus_2022_projected",
          "dataset_type": "standalone"
        },
        {
          "dataset_id": "resstock_conus_2022_projected",
          "dataset_type": "standalone"
        },
        {
          "dataset_id": "tempo_conus_2022_mapped",
          "dataset_type": "standalone"
        },
      ],
      "params": {
        "dimension_filters": [
          {
            "dimension_type": "model_year",
            "dimension_query_name": "model_year",
            "column": "id",
            "operator": "isin",
            "value": [
              "2030",
              "2040",
              "2050",
            ],
            "filter_type": "DimensionFilterColumnOperatorModel"
          }
        ],
      }
    },
  },
  "result": {
    "replace_ids_with_names": false,
    "aggregations": [
      {
        "aggregation_function": "sum",
        "dimensions": {
          "geography": [
            {
              "dimension_query_name": "state",
            }
          ],
          "metric": [
            {
              "dimension_query_name": "end_uses_by_fuel_type",
            }
          ],
          "model_year": [
            {
              "dimension_query_name": "model_year",
            }
          ],
          "scenario": [
            {
              "dimension_query_name": "scenario",
            }
          ],
          "sector": [
            {
              "dimension_query_name": "sector",
            }
          ],
          "subsector": [
          ],
          "time": [
            {
              "dimension_query_name": "time_est",
              "function": "year",
              "alias": "year"
            }
          ],
          "weather_year": [
            {
              "dimension_query_name": "weather_2012",
            }
          ]
        }
      }
    ],
    "reports": [],
    "column_type": "dimension_query_names",
    "dimension_filters": [],
    "time_zone": null
  }
}

If you only care about a limited number of fuel types, you could add this filter to the dataset params:

"params": {
  "dimension_filters": [
    {
      "dimension_type": "metric",
      "dimension_query_name": "end_uses_by_fuel_type",
      "column": "fuel_id",
      "value": [
        "electricity",
        "natural_gas"
      ],
      "operator": "isin",
      "negate": false,
      "filter_type": "SupplementalDimensionFilterColumnOperatorModel"
    }
  ],
}
  1. Start a Spark cluster with two compute nodes as described in How to Start a Spark Cluster on Kestrel.

  2. Activate a Python virtual environment that includes dsgrid.

$ conda activate dsgrid
  1. Run the query.

$ spark-submit --master=spark://$(hostname):7077 $(which dsgrid-cli.py) query project run query.json5

The query may take ~55 minutes.

  1. Inspect the output table.

$ pyspark --master=spark://$(hostname):7077
>>> df = spark.read.load("query_output/load-per-state-2030/table.parquet")
>>> columns = ["time_est", "state", "scenario", "sector", "weather_2012", "all_electricity"]
>>> df.sort("state", "scenario", "model_year", "time_est").show()
+-----+----------+------------+------+-------------------+------------+--------------------+-------------------+--------------------+------------------+
|state|model_year|    scenario|sector|           time_est|weather_2012|electricity_end_uses|  fuel_oil_end_uses|natural_gas_end_uses|  propane_end_uses|
+-----+----------+------------+------+-------------------+------------+--------------------+-------------------+--------------------+------------------+
|   AL|      2030|efs_high_ldv|   com|2011-12-31 22:00:00|        2012|   620.7906599513221| 0.5387437841876448|  129.36033825268063| 5.420073700645743|
|   AL|      2030|efs_high_ldv|   com|2011-12-31 23:00:00|        2012|   614.3206498969266| 0.5416918956851451|  124.89964054800879|5.5127600846910925|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 00:00:00|        2012|   608.2849965404984| 0.5769061150253406|   131.3726191747269| 5.634768746851266|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 01:00:00|        2012|   611.5344463308626| 0.5971660979790878|  143.24735266593729| 5.788247589054716|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 02:00:00|        2012|   642.3981904857268| 0.6859885122836309|  182.33194073437588|  7.97610263995906|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 03:00:00|        2012|   717.3520732997924|  4.024472764883984|  370.80760961376876|22.227861344037187|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 04:00:00|        2012|   613.8294652248369| 2.0143366691532707|   343.2025577876601| 16.94645372648664|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 05:00:00|        2012|   658.7328024466709|  1.353741822119555|  350.85640893192993| 14.51586872394028|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 06:00:00|        2012|   699.8174234582644| 0.8924590078874647|   389.4158152004862| 10.21988642248965|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 07:00:00|        2012|   743.2481362935839| 0.6068532986319386|   450.8920847000712| 7.672432329899141|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 08:00:00|        2012|     793.64723585044| 0.5041067604373506|   475.3169294837448| 6.838798780678826|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 09:00:00|        2012|   841.1101704879942|0.45131205367098215|  467.61967258296016| 6.426079631558903|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 10:00:00|        2012|   869.4957512282607|0.35165281820491173|   442.2650153173674| 6.157433321806227|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 11:00:00|        2012|   882.6659925381028|0.33634962431492477|   407.6767924458409| 6.193132473615856|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 12:00:00|        2012|   871.6219175694675| 0.4538962808891562|   406.1393196887077|6.4655789088596896|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 13:00:00|        2012|   866.2804144237266| 0.7825813246602221|  425.44571896883167|  7.59699881137887|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 14:00:00|        2012|   898.4427579031129| 1.1631267804567154|  413.79402978076445| 9.040045654010711|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 15:00:00|        2012|   884.7541250936392| 1.6279990839937193|   409.6531761201896|10.819410305573035|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 16:00:00|        2012|   861.7277658774752| 2.1224667695070067|  398.15739373650064|12.415366975813644|
|   AL|      2030|efs_high_ldv|   com|2012-01-01 17:00:00|        2012|    832.487984827325| 2.6171310594480213|   375.1621027331863|13.728169793981515|
+-----+----------+------------+------+-------------------+------------+--------------------+-------------------+--------------------+------------------+