# Query a Dataset with Python In this tutorial you will learn how to query data from a published dsgrid dataset using DuckDB and Python. We'll show you how to access data from from [OEDI](https://data.openei.org/) and from [NLR's HPC](https://www.nrel.gov/hpc/). This tutorial uses the `state_level_simplified` dataset from the [tempo project](https://github.com/dsgrid/dsgrid-project-StandardScenarios/tree/main/tempo_project) as an example and accomplishes the following: - Reads data from an OEDI S3 bucket or the NLR HPC filesystem - Filter data for given model years, geography, scenario, and other dimensions - Export the query results to a pandas DataFrame or CSV ## Required Knowledge - How to set up a Python virtual environment - How to install Python modules - How to use a Jupyter notebook or Python interpreter ## Setup Python Environment From a terminal run the following commands to set up a virtual environment, activate it, and install the requisite packages: ````{tab-set} ```{tab-item} NLR HPC ```bash module load python python -m venv dsgrid-tutorial source dsgrid-tutorial/bin/activate pip install duckdb pandas ``` ```{tab-item} MAC/Linux ```bash python -m venv dsgrid-tutorial source dsgrid-tutorial/bin/activate pip install duckdb pandas ``` ```{tab-item} Windows ```bash python -m venv dsgrid-tutorial dsgrid-tutorial\Scripts\activate # For Powershell: dsgrid-tutorial\Scripts\Activate.ps1 pip install duckdb pandas ``` ```` ## Load Data ### 1. Enter a Python interpreter ```bash python ``` ### 2. Load .parquet files into a table ````{tab-set} ```{tab-item} OEDI ```python import duckdb tablename = "tbl" data_dir = "s3://nrel-pds-dsgrid/tempo/tempo-2022/v1.0.0" dataset_name = "state_level_simplified" filepath = f"{data_dir}/{dataset_name}" duckdb.sql(f"""CREATE TABLE {tablename} AS SELECT * FROM read_parquet('{filepath}/table.parquet/**/*.parquet', hive_partitioning=true, hive_types_autocast=false)""") duckdb.sql(f"DESCRIBE {tablename}") # shows columns and types duckdb.sql(f"SELECT * FROM {tablename} LIMIT 5").to_df() # shows first 5 rows ``` ```{tab-item} NLR HPC (currently Kestrel) ```python import duckdb tablename = "tbl" data_dir = "/datasets/dsgrid/dsgrid-tempo-v2022" dataset_name = "state_level_simplified" filepath = f"{data_dir}/{dataset_name}" duckdb.sql(f"""CREATE VIEW {tablename} AS SELECT * FROM read_parquet("{filepath}/table.parquet/**/*.parquet", hive_partitioning=true, hive_types_autocast=false)""") duckdb.sql(f"DESCRIBE {tablename}") # shows columns and types duckdb.sql(f"SELECT * FROM {tablename} LIMIT 5").to_df() # shows first 5 rows ``` ```` ## Filter Data with DuckDB One of the main advantages of using DuckDB is the ability to filter data while loading. If a table is created with a filter, DuckDB will not have to read all of the data to generate the requested table. This can make queries much more efficient. Using the same `tablename` and `filepath` from the sections above: ```python duckdb.sql(f"""CREATE TABLE {tablename} AS SELECT * FROM read_parquet('{filepath}/table.parquet/**/*.parquet', hive_partitioning=true, hive_types_autocast=false) WHERE state='MI' AND scenario='efs_high_ldv' """) ``` ## Aggregation and Metadata This example covers two distinct topics: - Aggregation with DuckDB - How to use dsgrid metadata in a query dsgrid datasets contain a `metadata.json` file that specifies dimensions, their column names, query_names, and the value column of the dataset. The best way to use this metadata is to load it as `TableMetadata` using the provided `dsgrid/scripts/table_metadata.py` file. The `TableMetadata` can be loaded with pydantic installed and if using OEDI, pyarrow will also be needed to load the metadata.json. To load the table_metadata script, either copy it from GitHub into a directory that will be used as the dsgrid_path in the Read Metadata step, or clone dsgrid and use the repository. Set the `scripts_path` variable to the directory that contains `table_metadata.py`. If using a dsgrid repo, this path will be in the `dsgrid/scripts` directory. ### Setup ````{tab-set} ```{tab-item} OEDI ```bash pip install pydantic pyarrow ``` ```{tab-item} NLR HPC ```bash pip install pydantic ``` ```` ### Read Metadata ````{tab-set} ```{tab-item} NLR HPC ```python from pathlib import Path import sys scripts_path = Path("") sys.path.append(str(scripts_path)) from scripts.table_metadata import TableMetadata dataset_path = "/datasets/dsgrid/dsgrid-tempo-v2022/state_level_simplified" metadata_path = f"{dataset_path}/metadata.json" table_metadata = TableMetadata.from_file(metadata_path) ``` ```{tab-item} OEDI ```python from pathlib import Path import sys scripts_path = Path("") sys.path.append(str(scripts_path)) from scripts.table_metadata import TableMetadata bucket = "nrel-pds-dsgrid" filepath = "tempo/tempo-2022/v1.0.0/state_level_simplified/metadata.json" table_metadata = TableMetadata.from_s3(bucket, filepath) ``` ```` ### Use Metadata for Aggregation These metadata `columns_by_type` and `value_column` can be used to write queries that would apply to different datasets. The following example will query the `state_level_simplified` dataset and aggregate the results by: `model_year`, `scenario`, `geography`, and `subsector` with a column for the value summed up across groups. Each dimension could have multiple columns, so we first create the `group_by_cols` from the metadata, and use this list to create the table. ```python group_by_dimensions = ['model_year', 'scenario', 'geography', 'subsector'] group_by_cols = [] for dimension in group_by_dimensions: group_by_cols.extend(table_metadata.list_columns(dimension)) group_by_str = ", ".join(group_by_cols) value_column = table_metadata.value_column duckdb.sql(f"""CREATE TABLE {tablename} AS SELECT SUM({value_column}) AS value_sum, {group_by_str} FROM read_parquet('{filepath}/table.parquet/**/*.parquet') GROUP BY {group_by_str} """) ``` :::{note} This query would also work on the `full_dataset` by using metadata for dimensions, but that query could take hours or fail because of memory limitations. ::: ## Export Data ### 1. Create a pandas DataFrame After loading and possibly filtering data from the previous steps: ```python dataframe = duckdb.sql(f"SELECT * FROM {tablename}").df() ``` ### 2. Export DataFrame to CSV ```python dataframe.to_csv('mydata.csv') ``` ## Next Steps - See [how to work with Apache Spark](../../user_guide/apache_spark/index) for larger datasets