.. _dataset-formats:

****************
Dataset Formats
****************

dsgrid aims to support all dataset formats that users need for efficient queries and analysis. If
you need a new format, please contact the dsgrid team to discuss it.

Requirements
=============
1. Metric data should usually be stored in Parquet files. Tiny datasets like those for growth rates
   can be stored in CSV files. If you need or want another optimized columnar file format, please
   contact the dsgrid team.
2. If the data tables contain time-series data, each unique time array must contain an identical
   range of timestamps.
3. Values of dimension columns must be strings. This includes ``model_year`` and ``weather_year``.
4. Each dimension column name except time must match dsgrid dimension types (geography, sector,
   subsector, etc.).
5. The values in each dimension column must match the dataset's dimension records.

Recommendations
===============
1. Enable compression in all Parquet files. ``Snappy`` is preferred.
2. The recommended size of individual Parquet files is 128 MiB. Making the files too big can cause
   memory issues. Making them too small adds overhead and hurts performance.
3. Trivial dimensions (one-element records) should not be stored in the data files. They should
   instead be defined in the dataset config. dsgrid will add them dynamically at runtime.
4. Consider the appropriate floating point precision. 64-bit floats may be needed but will double
   the storage space. 32-bit floats may be acceptable.

.. warning:: Currently, the pivoted dimension must be the metric dimension. This limitation is
   expected to be fixed soon.

Time
====

Time zones
----------
Timestamps must be converted to UTC when written to the Parquet files. Do not use the Pandas
feature where it records time zone information into the Parquet metadata.

We recommend that you use Spark to create the Parquet files, but that is not required.
If you do use Spark, note the following:

- Spark implicitly interprets timestamps in the time zone of the current SQL session and converts
  them to UTC when writing dataframes to Parquet.
- You can override the SQL session time zone programmatically or in your Spark configuration file.
  The setting is ``spark.sql.session.timeZone``.

Time zone aware timestamps
~~~~~~~~~~~~~~~~~~~~~~~~~~
``dsgrid`` can convert timestamps in data tables to the proper time zone looking up the time
dimension.

Time zone unaware timestamps
----------------------------
Time-zone-unaware timestamps that will be interpreted as local time should be written as UTC
timestamps (i.e., 12pm with no time zone should be written as 12pm UTC).


Formats
=======
Input datasets can use a one-table or two-table format as described below.

Both formats support pivoting the record IDs of one dimension as an option.

- ``Pivoted``: All dimensions except the pivoted dimension are columns in the table. The record IDs
  of the pivoted dimension are columns in the table. Several dsgrid datasets
  pivot the metric dimension in order to avoid many repeated rows of other dimensions. This saves
  storage space but can make queries more complicated. dsgrid handles that complexity on the back
  end, but this point can still apply to users that inspect the raw datasets.
- ``Unpivoted``: The table has one column per dimension (except time, which might have more than
  one column) and a column called ``value`` that contains the data values. This format
  makes queries simpler. It is also good for cases when there is not a sensible dimension to pivot.

.. _one-table-format:

One Table Format
----------------
All metric data and dimension records are stored in one Parquet file.

::

    +-------------------+---------+------------------+--------------------+
    |          timestamp|geography|          scenario|           subsector|
    +-------------------+---------+------------------+--------------------+
    |2011-12-31 22:00:00|    01001|      efs_high_ldv|full_service_rest...|
    |2011-12-31 22:00:00|    01001|      efs_high_ldv|      primary_school|
    |2011-12-31 22:00:00|    01001|      efs_high_ldv|quick_service_res...|
    |2011-12-31 22:00:00|    01001|      efs_high_ldv|   retail_standalone|
    |2011-12-31 22:00:00|    01001|      efs_high_ldv|    retail_stripmall|
    |2011-12-31 22:00:00|    01001|      efs_high_ldv|         small_hotel|
    |2011-12-31 22:00:00|    01001|      efs_high_ldv|        small_office|
    |2011-12-31 22:00:00|    01001|      efs_high_ldv|           warehouse|
    |2011-12-31 22:00:00|    01001|ldv_sales_evs_2035|full_service_rest...|
    |2011-12-31 22:00:00|    01001|ldv_sales_evs_2035|      primary_school|
    |2011-12-31 22:00:00|    01001|ldv_sales_evs_2035|quick_service_res...|
    |2011-12-31 22:00:00|    01001|ldv_sales_evs_2035|   retail_standalone|
    |2011-12-31 22:00:00|    01001|ldv_sales_evs_2035|    retail_stripmall|
    |2011-12-31 22:00:00|    01001|ldv_sales_evs_2035|         small_hotel|
    |2011-12-31 22:00:00|    01001|ldv_sales_evs_2035|        small_office|
    |2011-12-31 22:00:00|    01001|ldv_sales_evs_2035|           warehouse|
    |2011-12-31 22:00:00|    01001|         reference|full_service_rest...|
    |2011-12-31 22:00:00|    01001|         reference|      primary_school|
    |2011-12-31 22:00:00|    01001|         reference|quick_service_res...|
    |2011-12-31 22:00:00|    01001|         reference|   retail_standalone|
    +-------------------+---------+------------------+--------------------+

.. _two-table-format:

Two Table Format (Standard)
----------------------------
Two Parquet files comprise the dataset:

- ``load_data.parquet``: Metric data, usually with time-series data. This example pivots the metric
  dimension records.

::

    +-------------------+----+-------------------+--------------------+--------------------+
    |          timestamp|  id|            heating|             cooling|  interior_equipment|
    +-------------------+----+-------------------+--------------------+--------------------+
    |2012-01-01 00:00:00|9106| 0.2143171631469727|0.001987764734408426|0.051049410357755676|
    |2012-01-01 01:00:00|9106| 0.3290653818000351|9.035294172606012E-5|0.051049410357755676|
    |2012-01-01 02:00:00|9106|0.36927244565896444|                 0.0| 0.06622870555383997|
    |2012-01-01 03:00:00|9106| 0.3869816717726735|                 0.0| 0.06622870555383997|
    |2012-01-01 04:00:00|9106| 0.3872526947775441|                 0.0| 0.06622870555383997|
    +-------------------+----+-------------------+--------------------+--------------------+


- ``load_data_lookup.parquet``: Metadata that connects dimension records with the metric data. Must
  include a row with a null ``id`` for every combination of required dimensions that does not exist
  in ``load_data``.

::

    +---------+------+----------+-------+
    |geography|sector| subsector|     id|
    +---------+------+----------+-------+
    |    53061|   com|  Hospital|      1|
    |    53053|   com|  Hospital|      2|
    |    53005|   com|  Hospital|      3|
    |    53025|   com|  Hospital|      4|
    |    53045|   com|  Hospital|      5|
    +---------+------+----------+-------+

Each unique time array in ``load_data`` must be denoted with an ID that corresponds to a record in
``load_data_lookup``. The ID is user-defined. Users may want to use a sequentially-increasing
integer or encode other information into specific bytes of each integer.

The table may optionally include the column ``scaling_factor`` to account for cases where the value
columns need to multiplied by a scaling factor. If ``scaling_factor`` does not apply, the value in
the row can be ``1.0`` or ``null``.

This format minimizes file storage because

1. Time arrays can be shared across combinations of dimension records, possibly with different
   scaling factors.
2. Dimension information is not repeated for every timestamp. (This could be minimal because of
   compression inside the Parquet files.)

Time Formats
============

DateTime
--------
The load data table has one column representing time, typically called ``timestamp``. When written
to Parquet files the type should be the ``TIMESTAMP`` logical type (integer, not string) and be
adjusted to UTC. When read into Spark the type should be ``TimestampType`` (not
``TimestampNTZType``).

Handling of no-time-zone timestamps (Spark type ``TimestampNTZType``) is possible. Contact the
dsgrid team if you need this.

Annual
------
Load data contains one value per model year.

::

    [2020, 2021, 2022]

Representative Period
---------------------
Metric data contains timestamps that represent multiple periods. dsgrid supports the following
formats:

one_week_per_month_by_hour
~~~~~~~~~~~~~~~~~~~~~~~~~~
Each time array contains one week of hourly data (24 hours per day) that
applies to an entire month. The times represent local time (no time zone).
There are no shifts, missing hours, or extra hours for daylight savings time.

- All time columns must be integers.
- `month` is one-based, starting in January. ``Jan`` -> 1, ``Feb`` -> 2, etc.
- `day_of_week` is zero-based, starting on Monday. ``Mon`` -> 0, ``Tue`` -> 1, etc.
- `hour` is zero-based, starting at midnight.

::

    +---+-----+-----------+----+--------+
    | id|month|day_of_week|hour|dim_col1|
    +---+-----+-----------+----+--------+
    |  1|    4|          0|   0|     1.0|
    |  1|    4|          0|   1|     1.0|
    |  1|    4|          0|   2|     1.0|
    |  1|    4|          0|   3|     1.0|
    |  1|    4|          0|   4|     1.0|
    |  1|    4|          0|   5|     1.0|
    |  1|    4|          0|   6|     1.0|
    |  1|    4|          0|   7|     1.0|
    |  1|    4|          0|   8|     1.0|
    |  1|    4|          0|   9|     1.0|
    |  1|    4|          0|  10|     1.0|
    |  1|    4|          0|  11|     1.0|
    |  1|    4|          0|  12|     1.0|
    |  1|    4|          0|  13|     1.0|
    |  1|    4|          0|  14|     1.0|
    |  1|    4|          0|  15|     1.0|
    |  1|    4|          0|  16|     1.0|
    |  1|    4|          0|  17|     1.0|
    |  1|    4|          0|  18|     1.0|
    |  1|    4|          0|  19|     1.0|
    |  1|    4|          0|  20|     1.0|
    |  1|    4|          0|  21|     1.0|
    |  1|    4|          0|  22|     1.0|
    |  1|    4|          0|  23|     1.0|
    |  1|    4|          1|   0|     1.0|
    +---+-----+-----------+----+--------+

dsgrid can add support for other period formats. Please submit requests as
needed.