Dataset Submitters

Dataset submitters prepare and register datasets for inclusion in a dsgrid project. The first step, dataset registration, involves defining dimensions, creating a dataset configuration file, and verifying the dataset’s internal consistency (schema, dimensions, and data completeness). Once that is complete, the data submitter prepares for project submittal by creating dimension mappings and an associated mappings configuration file. When everything is submitted to the project, the project verifies the internal consistency of the dimension mappings and that the dataset provides all expected data points.

Dataset registration is supported by the commands dsgrid registry datasets generate-config and dsgrid registry datasets register. The intention is for dataset submitters to go through these steps themselves, in the same computational environments they used to create the original dataset(s). Project submittal and subsequent use sometimes involves exploding out the dimensions of the dataset, in which case project submission might be performed primarily by the project coordinator using Apache Spark. For smaller datasets and projects, the data submitter might perform this step themselves as well, using dsgrid registry projects submit-dataset or dsgrid registry projects register-and-submit-dataset.

Prerequisites

  • Install dsgrid on your system

  • Create or access a dsgrid registry (a pre-populated dsgrid registry can assist with identifying dimensions)

  • Your dataset in a supported format (see Data File Formats)

  • Familiarity with, or an integrated development environment (IDE) extension for, JSON5 syntax

  • Access to the project config file and optionally the project registry

    • The config file is typically available in a project-specific repository of config files (e.g., dsgrid-project-IEF)

    • Be prepared to iterate with the project coordinator to bring the project and dataset configurations into alignment

Workflow Overview

Phase 1 — Dataset Registration

Registers the dataset as a standalone entity in the registry. Validates internal integrity (schema, dimensions, and data completeness). No dsgrid project is required.

  1. Understand the fundamentals — Read Dimension Concepts and Dataset Concepts to understand how dsgrid organizes data.

  2. Create an initial draft of the config and dimension record files — Run dsgrid registry datasets generate-config to auto-generate a dataset.json5 and dimension record CSVs from your data file(s). The tool searches the registry for matching dimensions (prioritizing project base dimensions if a -P argument is passed).

  3. Refine your dataset config and dimensions — Review and edit the generated config and dimension record files. Regarding the config file, see Dataset Concepts for guidance and the Dataset Data Model for the full schema. Follow How to Create Dataset Dimensions for guidance on dimension records.

  4. Register your dataset — Run dsgrid registry datasets register. This validates internal integrity: schema, dimensions, and data completeness.

  5. Address missing dimension associations — If registration fails with missing records, dsgrid writes the missing combinations to a Parquet file and runs pattern analysis (via find_minimal_patterns) to help identify root causes. Either fix the data gaps or declare expected missing associations in the dataset config using the missing_associations field (and use the -M CLI option to specify a base directory for the missing association files as needed). Iterate on steps 4–5 as needed. See How to Handle Missing Dimension Associations for the full workflow.

Phase 2 — Project Submittal

Submits the registered dataset to a specific project. Dimension mappings are usually required to align dataset dimensions with project base dimensions. Validates that dimension mappings are consistent and that the dataset provides all expected data points.

  1. Review project requirements — Check what dimensions and data points the project expects from your dataset. Browse the project’s repository of config files or use How to Browse the Registry to inspect the project’s base dimensions.

  2. Create dimension mappings — Map dataset dimensions to project base dimensions.

  3. Submit your dataset to the project — Run dsgrid registry projects submit-dataset (or use register-and-submit-dataset for a combined operation). Follow the Dataset Submission Process for details.

When You Need Apache Spark

Small datasets can be registered using the default DuckDB backend. If your dataset is large or maps onto high-resolution project dimensions (e.g., hourly × county), the submission step may require Spark for adequate performance. In that case:

Key Resources

Core Concepts

How-Tos

Tutorials