Query a Dataset with Python¶
In this tutorial you will learn how to query data from a published dsgrid dataset using DuckDB and Python. We’ll show you how to access data from from OEDI and from NLR’s HPC.
This tutorial uses the state_level_simplified dataset from the tempo project as an example and accomplishes the following:
Reads data from an OEDI S3 bucket or the NLR HPC filesystem
Filter data for given model years, geography, scenario, and other dimensions
Export the query results to a pandas DataFrame or CSV
Required Knowledge¶
How to set up a Python virtual environment
How to install Python modules
How to use a Jupyter notebook or Python interpreter
Setup Python Environment¶
From a terminal run the following commands to set up a virtual environment, activate it, and install the requisite packages:
module load python
python -m venv dsgrid-tutorial
source dsgrid-tutorial/bin/activate
pip install duckdb pandas
python -m venv dsgrid-tutorial
source dsgrid-tutorial/bin/activate
pip install duckdb pandas
python -m venv dsgrid-tutorial
dsgrid-tutorial\Scripts\activate # For Powershell: dsgrid-tutorial\Scripts\Activate.ps1
pip install duckdb pandas
Load Data¶
1. Enter a Python interpreter¶
python
2. Load .parquet files into a table¶
import duckdb
tablename = "tbl"
data_dir = "s3://nrel-pds-dsgrid/tempo/tempo-2022/v1.0.0"
dataset_name = "state_level_simplified"
filepath = f"{data_dir}/{dataset_name}"
duckdb.sql(f"""CREATE TABLE {tablename} AS SELECT *
FROM read_parquet('{filepath}/table.parquet/**/*.parquet',
hive_partitioning=true, hive_types_autocast=false)""")
duckdb.sql(f"DESCRIBE {tablename}") # shows columns and types
duckdb.sql(f"SELECT * FROM {tablename} LIMIT 5").to_df() # shows first 5 rows
import duckdb
tablename = "tbl"
data_dir = "/datasets/dsgrid/dsgrid-tempo-v2022"
dataset_name = "state_level_simplified"
filepath = f"{data_dir}/{dataset_name}"
duckdb.sql(f"""CREATE VIEW {tablename} AS SELECT *
FROM read_parquet("{filepath}/table.parquet/**/*.parquet",
hive_partitioning=true, hive_types_autocast=false)""")
duckdb.sql(f"DESCRIBE {tablename}") # shows columns and types
duckdb.sql(f"SELECT * FROM {tablename} LIMIT 5").to_df() # shows first 5 rows
Filter Data with DuckDB¶
One of the main advantages of using DuckDB is the ability to filter data while loading. If a table is created with a filter, DuckDB will not have to read all of the data to generate the requested table. This can make queries much more efficient.
Using the same tablename and filepath from the sections above:
duckdb.sql(f"""CREATE TABLE {tablename} AS SELECT *
FROM read_parquet('{filepath}/table.parquet/**/*.parquet',
hive_partitioning=true, hive_types_autocast=false)
WHERE state='MI' AND scenario='efs_high_ldv'
""")
Aggregation and Metadata¶
This example covers two distinct topics:
Aggregation with DuckDB
How to use dsgrid metadata in a query
dsgrid datasets contain a metadata.json file that specifies dimensions, their column names, query_names, and the value column of the dataset. The best way to use this metadata is to load it as TableMetadata using the provided dsgrid/scripts/table_metadata.py file. The TableMetadata can be loaded with pydantic installed and if using OEDI, pyarrow will also be needed to load the metadata.json.
To load the table_metadata script, either copy it from GitHub into a directory that will be used as the dsgrid_path in the Read Metadata step, or clone dsgrid and use the repository. Set the scripts_path variable to the directory that contains table_metadata.py. If using a dsgrid repo, this path will be in the dsgrid/scripts directory.
Setup¶
pip install pydantic pyarrow
pip install pydantic
Read Metadata¶
from pathlib import Path
import sys
scripts_path = Path("<insert path here>")
sys.path.append(str(scripts_path))
from scripts.table_metadata import TableMetadata
dataset_path = "/datasets/dsgrid/dsgrid-tempo-v2022/state_level_simplified"
metadata_path = f"{dataset_path}/metadata.json"
table_metadata = TableMetadata.from_file(metadata_path)
from pathlib import Path
import sys
scripts_path = Path("<insert path here>")
sys.path.append(str(scripts_path))
from scripts.table_metadata import TableMetadata
bucket = "nrel-pds-dsgrid"
filepath = "tempo/tempo-2022/v1.0.0/state_level_simplified/metadata.json"
table_metadata = TableMetadata.from_s3(bucket, filepath)
Use Metadata for Aggregation¶
These metadata columns_by_type and value_column can be used to write queries that would apply to different datasets. The following example will query the state_level_simplified dataset and aggregate the results by: model_year, scenario, geography, and subsector with a column for the value summed up across groups. Each dimension could have multiple columns, so we first create the group_by_cols from the metadata, and use this list to create the table.
group_by_dimensions = ['model_year', 'scenario', 'geography', 'subsector']
group_by_cols = []
for dimension in group_by_dimensions:
group_by_cols.extend(table_metadata.list_columns(dimension))
group_by_str = ", ".join(group_by_cols)
value_column = table_metadata.value_column
duckdb.sql(f"""CREATE TABLE {tablename} AS
SELECT SUM({value_column}) AS value_sum, {group_by_str}
FROM read_parquet('{filepath}/table.parquet/**/*.parquet')
GROUP BY {group_by_str}
""")
Note
This query would also work on the full_dataset by using metadata for dimensions, but that query could take hours or fail because of memory limitations.
Export Data¶
1. Create a pandas DataFrame¶
After loading and possibly filtering data from the previous steps:
dataframe = duckdb.sql(f"SELECT * FROM {tablename}").df()
2. Export DataFrame to CSV¶
dataframe.to_csv('mydata.csv')
Next Steps¶
See how to work with Apache Spark for larger datasets