***************************************
Map a dataset to a project's dimensions
***************************************
It is often beneficial to map a dataset to a project's dimensions before running queries with other
datasets that perform aggregations or filters. Mapping a dataset with Spark can be an expensive
operation that takes several iterations to figure out with Spark. It is easier to debug in
isolation. Once complete, the cached result can be used for subsequent queries.

This page assumes that you have already registered a dataset and submitted it to a project.
It also assumes that you have populated your ``~/.dsgrid.json5`` file with the location of your
dsgrid registry.

Spark runtime details are not covered here. Refer to :ref:`spark-overview`.

Basic operation
===============
dsgrid offers a CLI command to perform the mapping operation. This is its simplest form:

.. code-block:: console

    $ dsgrid query project map-dataset my-project-id my-dataset-id

By default, this will attempt to map all dimensions by performing three Spark queries:

1. Map all dimensions other than time that do not already match the project. Apply scaling factors
   if assigned and automatically convert units if applicable. Persist the result to the filesystem.
2. Map the time dimension. Persist the result to the filesystem.
3. Finalize the table: convert user-defined options, such as column names. Add null rows as
   necessary.

If the dataset is less than 10 GB, this process should run smoothly with Spark. If the dataset
grows to hundreds of GBs or more, you may experience problems. Our recommendation is to use
the dsgrid mapping plan features described below to work in an iterative manner.

Mapping Plan
============
Create a mapping plan as shown in the data model at :ref:`dataset_mapping-plan-reference`. This plan
allows you to specify the order of mapping operations as well as whether to persist intermediate
tables.

If you set ``persist=true`` for an operation, dsgrid will persist the query to the
filesystem and record a metadata file. It can resume from that checkpoint on subsequent iterations.

Points to consider when creating a mapping plan:

- If a dimension mapping operation will reduce the size of data, perhaps because it is
  aggregating data, list that operation first and persist it.
- If a dimension mapping operation will increase the size of data, such as a disaggregation or
  duplication, list that operation last and persist the query just before it. We have experienced
  the most problems with Spark with this type of operation.
- Some disaggregation operations can cause data skew. dsgrid will automatically enable techniques
  to handle this condition with certain mapping types. If you experience this problem, you may need
  to set `handle_data_skew: true` in the mapping plan for that operation. Refer to
  :ref:`executors-spilling-to-disk` for information on how to identify this condition.

Below is an example mapping plan in JSON formation. The dataset in this example has a one-to-one
mapping for the scenario dimension, a many-to-many mapping for the model_year dimension, and a
disaggregation from state to county for the geography dimension. The Spark query for the geography
disaggregation is failing. Here is our rationale for the plan:

1. Persist the result after mapping the scenario and model_year dimensions. This part is working,
   but takes some time. We may have to run the geography disaggregation several times, and so we
   want to avoid repeating this work.
2. Persist the result after mapping the geography dimension so that we don't have to repeat the work
   once we figure out the solution.

.. code-block:: JavaScript

    {
      dataset_id: "my-dataset-id",
      mappings: [
        {
          name: "scenario",
        },
        {
          name: "model_year",
          persist: true,
        },
        {
          name: "county",
          persist: true,
        },
      ],
    }

Execution with a mapping plan
=============================

.. code-block:: console

    $ dsgrid query project map-dataset my-project-id my-dataset-id \
        --mapping-plan plan.json5

Observe progress in the console. Whenever dsgrid persists an intermediate query, it will log a
message like this:

.. code-block:: console

    2025-07-08 14:29:21,762 - INFO [dsgrid.dataset.dataset_mapping_manager dataset_mapping_manager.py:99] : Saved checkpoint in /kfs3/scratch/dthom/dsgrid-project/__dsgrid_scratch__/tmpgn_6xbst.json

If the job fails, you can resume by specifying that checkpoint file as follows:

.. code-block:: console

    $ dsgrid query project map-dataset my-project-id my-dataset-id \
        --mapping-plan plan.json5 \
        --checkpoint-file /kfs3/scratch/dthom/dsgrid-project/__dsgrid_scratch__/tmpgn_6xbst.json

Note that the checkpoint file defines what mapping operations completed and contains a reference to
the persisted table. You can use that table to perform your own debugging. You could look at the
size and number of partitions of the table, for example, to see if it is what you expect.

.. code-block:: console

    $ cat /kfs3/scratch/dthom/ief-registry-y2-3/__dsgrid_scratch__/tmpgn_6xbst.json
    {
      "dataset_id": "my-dataset-id",
      "completed_operation_names": [
          "scenario",
          "model_year",
      ],
      "persisted_table_filename": "/kfs3/scratch/dthom/dsgrid-project/__dsgrid_scratch__/tmpcrpladhx.parquet",
      "mapping_plan_hash": "558083c65760db8fc7bcbbaf48cc94fd1364198b941b6ad845213877d794200c",
      "timestamp": "2025-07-08T14:29:21.746195"
    }