Dataset Submitters¶
Dataset submitters prepare and register datasets for inclusion in a dsgrid project. The first step, dataset registration, involves defining dimensions, creating a dataset configuration file, and verifying the dataset’s internal consistency (schema, dimensions, and data completeness). Once that is complete, the data submitter prepares for project submittal by creating dimension mappings and an associated mappings configuration file. When everything is submitted to the project, the project verifies the internal consistency of the dimension mappings and that the dataset provides all expected data points.
Dataset registration is supported by the commands dsgrid registry datasets generate-config and dsgrid registry datasets register. The intention is for dataset submitters to go through these steps themselves, in the same computational environments they used to create the original dataset(s). Project submittal and subsequent use sometimes involves exploding out the dimensions of the dataset, in which case project submission might be performed primarily by the project coordinator using Apache Spark. For smaller datasets and projects, the data submitter might perform this step themselves as well, using dsgrid registry projects submit-dataset or dsgrid registry projects register-and-submit-dataset.
Prerequisites¶
Install dsgrid on your system
Create or access a dsgrid registry (a pre-populated dsgrid registry can assist with identifying dimensions)
Your dataset in a supported format (see Data File Formats)
Familiarity with, or an integrated development environment (IDE) extension for, JSON5 syntax
Access to the project config file and optionally the project registry
The config file is typically available in a project-specific repository of config files (e.g., dsgrid-project-IEF)
Be prepared to iterate with the project coordinator to bring the project and dataset configurations into alignment
Workflow Overview¶
Phase 1 — Dataset Registration¶
Registers the dataset as a standalone entity in the registry. Validates internal integrity (schema, dimensions, and data completeness). No dsgrid project is required.
Understand the fundamentals — Read Dimension Concepts and Dataset Concepts to understand how dsgrid organizes data.
Create an initial draft of the config and dimension record files — Run
dsgrid registry datasets generate-configto auto-generate adataset.json5and dimension record CSVs from your data file(s). The tool searches the registry for matching dimensions (prioritizing project base dimensions if a-Pargument is passed).Refine your dataset config and dimensions — Review and edit the generated config and dimension record files. Regarding the config file, see Dataset Concepts for guidance and the Dataset Data Model for the full schema. Follow How to Create Dataset Dimensions for guidance on dimension records.
Register your dataset — Run
dsgrid registry datasets register. This validates internal integrity: schema, dimensions, and data completeness.Address missing dimension associations — If registration fails with missing records, dsgrid writes the missing combinations to a Parquet file and runs pattern analysis (via
find_minimal_patterns) to help identify root causes. Either fix the data gaps or declare expected missing associations in the dataset config using themissing_associationsfield (and use the-MCLI option to specify a base directory for the missing association files as needed). Iterate on steps 4–5 as needed. See How to Handle Missing Dimension Associations for the full workflow.
Phase 2 — Project Submittal¶
Submits the registered dataset to a specific project. Dimension mappings are usually required to align dataset dimensions with project base dimensions. Validates that dimension mappings are consistent and that the dataset provides all expected data points.
Review project requirements — Check what dimensions and data points the project expects from your dataset. Browse the project’s repository of config files or use How to Browse the Registry to inspect the project’s base dimensions.
Create dimension mappings — Map dataset dimensions to project base dimensions.
Submit your dataset to the project — Run
dsgrid registry projects submit-dataset(or useregister-and-submit-datasetfor a combined operation). Follow the Dataset Submission Process for details.
When You Need Apache Spark¶
Small datasets can be registered using the default DuckDB backend. If your dataset is large or maps onto high-resolution project dimensions (e.g., hourly × county), the submission step may require Spark for adequate performance. In that case:
Install the Spark extras:
pip install "dsgrid-toolkit[spark]"See How to Start a Spark Cluster on Kestrel for running on NLR HPC