Data File Formats

dsgrid aims to support all data file formats that users need for efficient queries and analysis. If you need a new format, please contact the dsgrid team to discuss it.

Requirements

  1. Metric data should usually be stored in Parquet files. CSV files are also supported. If you need or want another optimized columnar file format, please contact the dsgrid team.

  2. If the data tables contain time-series data, each unique time array must contain an identical range of timestamps.

  3. Values of dimension columns except model_year and weather_year must be strings. model_year and weather_year can be integers.

  4. Each dimension column name except time must match dsgrid dimension types (geography, sector, subsector, etc.) either directly or by specifying a column mapping.

  5. The values in each dimension column must match the dataset’s dimension records.

Recommendations

  1. Enable compression in all Parquet files. Snappy is preferred.

  2. The recommended size of individual Parquet files is 128 MiB. Making the files too big can cause memory issues. Making them too small adds overhead and hurts performance.

  3. Trivial dimensions (one-element records) should not be stored in the data files. They should instead be defined in the dataset config. dsgrid will add them dynamically at runtime.

  4. Floating point data can be 64-bit or 32-bit. 64-bit floats provide more precision but require twice as much storage space as 32-bit floats.

Table Formats

Input datasets can use a one-table or two-table format.

One Table Format

All metric data and dimension records are stored in one Parquet file (or set of Parquet files in a directory).

The following example shows a stacked one-table format. The metric column contains dimension record IDs and the value column contains the data:

load_data.parquet

timestamp

geography

scenario

subsector

metric

value

2012-01-01T00:00:00+00:00

01001

reference

full_service_restaurant

heating

1.234

2012-01-01T00:00:00+00:00

01001

reference

full_service_restaurant

cooling

0.002

2012-01-01T00:00:00+00:00

01001

reference

full_service_restaurant

interior_equipment

0.051

2012-01-01T00:00:00+00:00

01001

reference

primary_school

heating

2.345

2012-01-01T01:00:00+00:00

01001

reference

full_service_restaurant

heating

1.456

Note

This example omits the sector, model_year, and weather_year columns because they are trivial dimensions — each has only a single record (e.g., sector = "com", model_year = 2020, weather_year = 2012). Trivial dimensions are declared in the dataset config and added by dsgrid at runtime, so they do not need to appear in the data files.

Two Table Format

Two Parquet files comprise the dataset: a data table with time-series metric values and a lookup table that maps an integer ID to combinations of dimension records.

Data table (load_data.parquet) — required columns and content:

  • id (integer) — links each row to a dimension-record combination in the lookup table.

  • Time columns — one or more columns representing time (e.g., timestamp). See Time Formats below.

  • Value data — either a single value column (stacked) or one column per metric record ID (pivoted). See Value Formats below.

  • Dimension columns are optional. Any non-time, non-trivial dimension not present here must appear in the lookup table. In practice, most dimensions go in the lookup table.

Lookup table (load_data_lookup.parquet) — required columns and content:

  • id (integer) — matches the IDs in the data table.

  • Every non-time, non-trivial dimension that is not already a column in the data table. This typically includes all non-time dimensions (geography, sector, subsector, metric, scenario, model_year, weather_year, etc.).

  • scaling_factor (float, optional) — see Lookup Table with Scaling Factor below.

Pivoted, No Scaling Factor

load_data.parquet — Time-series values for all metric data. This example pivots the metric dimension, so the metric record IDs (heating, cooling, interior_equipment) appear as column names. The id column links each row to a dimension record combination in the lookup table:

load_data.parquet (time-series metric values; example shows a pivoted metric dimension)

timestamp

id

heating

cooling

interior_equipment

2012-01-01T00:00:00+00:00

1

0.214

0.002

0.051

2012-01-01T01:00:00+00:00

1

0.329

0.000

0.051

2012-01-01T02:00:00+00:00

1

0.369

0.000

0.066

2012-01-01T00:00:00+00:00

2

1.023

0.015

0.102

2012-01-01T01:00:00+00:00

2

1.156

0.012

0.102

load_data_lookup.parquet — Maps each id to a combination of dimension records:

load_data_lookup.parquet

id

geography

sector

subsector

scenario

model_year

weather_year

1

01001

com

full_service_restaurant

reference

2020

2012

2

01001

com

primary_school

reference

2020

2012

3

01003

com

full_service_restaurant

reference

2020

2012

4

01003

com

primary_school

reference

2020

2012

Note

All non-time, non-trivial dimensions should appear in the lookup table. Trivial dimensions can still be omitted and declared in the dataset config. In this example, if sector, model_year, and weather_year were all trivial, the lookup table would only need id, geography, subsector, and scenario columns.

Each unique time array in load_data must be denoted with an integer ID that corresponds to a record in load_data_lookup. The ID is user-defined. Users may want to use a sequentially-increasing integer or encode other information into specific bytes/digits of each integer.

Lookup Table with Scaling Factor

The lookup table may optionally include a scaling_factor column. When present, dsgrid multiplies each value column by the row’s scaling factor at query time and then drops the scaling_factor column from the result.

This is useful when the data table stores normalized profiles and the actual magnitude varies by dimension combination. For example, a distributed generation dataset might store a single set of hourly capacity-factor profiles in the data table, while the lookup table records the installed capacity (in kW) for each geography. At query time, dsgrid multiplies the normalized profile by the installed capacity to produce absolute generation values. This format avoids duplicating 8,760 hourly rows for every geography.

load_data_lookup.parquet with scaling_factor

id

geography

sector

subsector

model_year

scaling_factor

1

01001

com

rooftop_pv

2020

10.5

1

01001

com

rooftop_pv

2025

102.3

1

01001

com

rooftop_pv

2030

245.7

Multiple rows can share the same hourly profile shape in load_data by referencing the same ID, but produce different absolute values because their scaling factors differ.

Note

If scaling_factor is null for a given row, the value passes through unchanged (i.e., no multiplication is applied). A scaling factor of 1.0 explicitly multiplies by one, which also leaves the value unchanged. Either is acceptable for rows that do not need scaling.

This format minimizes file storage because:

  1. Time arrays can be shared across combinations of dimension records, each with its own scaling factor.

  2. Dimension information is not repeated for every timestamp. (This could be minimal because of compression inside the Parquet files.)

Value Formats

Both table formats support two value layouts:

Stacked

Each non-time dimension has its own column, plus a single value column containing the data. A metric column identifies which metric each row represents. This format is a good default choice.

The one-table example above uses stacked format. The same layout works with two-table format: the value, metric, and any time columns appear in the data table, and other dimensions go in the lookup table.

Pivoted

The record IDs of one dimension become column names in the data table, and each row contains all values for that combination of other (non-pivoted) dimensions at once. In practice, the pivoted dimension is almost always metric. The code allows pivoting on other dimensions, but this is uncommon.

The two-table example above uses pivoted format. Here is a one-table pivoted equivalent in which the metric and value columns are replaced by one column per metric record ID:

One-table pivoted example

timestamp

geography

scenario

subsector

heating

cooling

interior_equipment

2012-01-01T00:00:00+00:00

01001

reference

full_service_restaurant

1.234

0.002

0.051

2012-01-01T00:00:00+00:00

01001

reference

primary_school

2.345

0.008

0.073

2012-01-01T01:00:00+00:00

01001

reference

full_service_restaurant

1.456

0.003

0.049

Pivoted format saves storage space by avoiding repeated dimension values across rows, but can make inspection and data operations more complex because the column names carry semantic meaning. dsgrid converts pivoted data to stacked format during registration, so all downstream processing and queries operate on stacked data.

Data Layout

The data_layout section of a dataset configuration defines the data file locations, table format, and value format. It has the following structure:

data_layout: {
  table_format: "two_table",       // or "one_table"
  value_format: "pivoted",         // or "stacked"
  pivoted_dimension_type: "metric",  // required if value_format is "pivoted"
  data_file: {
    path: "load_data.parquet",
    columns: [                     // optional
      {
        name: "column_name",       // actual name in the file
        data_type: "STRING",       // optional, for type override
        dimension_type: "geography",  // optional, for column renaming
      },
    ],
    ignore_columns: ["col1", "col2"],  // optional, columns to drop
  },
  lookup_data_file: {              // required for "two_table" format
    path: "load_data_lookup.parquet",
    columns: [],                   // optional, same structure as data_file
    ignore_columns: [],            // optional, same as data_file
  },
  missing_associations: [          // optional, list of paths
    "missing_associations.parquet",
    "additional_missing",
  ],
}

Fields:

  • table_format: Defines the table structure: "two_table" or "one_table".

  • value_format: Defines whether values are "pivoted" or "stacked".

  • pivoted_dimension_type: The dimension type whose records are columns (required when value_format is "pivoted").

  • data_file: Main data file configuration (required).

    • path: Path to the data file. Can be absolute or relative to the config file. You can also use the --data-base-dir CLI option to specify a different base directory for resolving relative paths.

    • columns: Optional list of column definitions for type overrides and renaming.

      • name: The actual column name in the file (required).

      • data_type: Data type override (optional). See Specifying Column Data Types for supported types.

      • dimension_type: The dsgrid dimension type this column represents (optional, but required if this column is not a time column and name is not already a dimension type). When specified, the column will be renamed to match the dimension type.

    • ignore_columns: Optional list of column names to drop when reading the file. Cannot overlap with columns defined in columns.

  • lookup_data_file: Lookup file configuration (required for two_table format). Has the same structure as data_file.

  • missing_associations: List of paths to files or directories defining missing dimension combinations (optional). Paths can be absolute or relative to the config file. You can also use the --missing-associations-base-dir CLI option to specify a different base directory for resolving relative paths. See How to Handle Missing Dimension Associations for details.

Column Configuration

CSV Files

While not generally recommended, dsgrid does support CSV data files. By default, dsgrid will let Spark and DuckDB attempt to infer the schema of the file. Because there may be cases of type ambiguities, such as integer vs string, integer vs float, and timestamps with time zones, dsgrid provides a mechanism for defining column data types directly in the dataset configuration.

Consider this example that uses county FIPS codes to identify the geography of each data point:

timestamp

geography

scenario

subsector

value

2011-12-31T22:00:00-07:00

01001

efs_high_ldv

full_service_restaurant

1.234

2011-12-31T22:00:00-07:00

01001

efs_high_ldv

primary_school

2.345

The default behavior of IO libraries like Pandas, Spark, and DuckDB is to infer data types by inspecting the data. They will all decide that the geography column contains integers and drop the leading zeros. This will result in an invalid geography column, which is required to be a string data type, and will not match the project’s geography dimension (assuming that the project is also defined over county FIPS codes).

Secondly, you may want to specify the minimum required size for each number. For example, if you don’t need the precision that comes with 8-byte floats, choose FLOAT and Spark/DuckDB will store all values in 4-byte floats, halving the required storage size.

Specifying Column Data Types

To specify column data types, add a columns field to the data_file (or lookup_data_file) section in your dataset configuration. You can specify types for all columns or just a subset — columns without explicit types will have their types inferred.

The supported data types are (case-insensitive):

  • BOOLEAN: boolean

  • INT: 4-byte integer

  • INTEGER: 4-byte integer

  • TINYINT: 1-byte integer

  • SMALLINT: 2-byte integer

  • BIGINT: 8-byte integer

  • FLOAT: 4-byte float

  • DOUBLE: 8-byte float

  • STRING: string

  • TEXT: string

  • VARCHAR: string

  • TIMESTAMP_TZ: timestamp with time zone

  • TIMESTAMP_NTZ: timestamp without time zone

Example dataset configuration with column data types:

data_layout: {
  table_format: "one_table",
  value_format: "stacked",
  data_file: {
    path: "load_data.csv",
    columns: [
      {
        name: "timestamp",
        data_type: "TIMESTAMP_TZ",
      },
      {
        name: "geography",
        data_type: "STRING",
      },
      {
        name: "scenario",
        data_type: "STRING",
      },
      {
        name: "subsector",
        data_type: "STRING",
      },
      {
        name: "value",
        data_type: "FLOAT",
      },
    ],
  },
}

You can also specify types for only the columns that need explicit typing:

data_file: {
  path: "load_data.csv",
  columns: [
    {
      name: "geography",
      data_type: "STRING",  // Prevent FIPS codes from being read as integers
    },
  ],
}

Custom Column Names

By default, dsgrid expects data files to have columns named after the standard dimension types (geography, sector, subsector, metric, etc.). Data files with value format “stacked” are also expected to have a value column. However, your data files may use different column names. dsgrid provides a mechanism to map custom column names to the expected dimension types.

To rename columns, add the dimension_type field to the column definition. This tells dsgrid what dimension the column represents, and dsgrid will automatically rename it at runtime.

This feature works for all file formats (Parquet, CSV), not just CSV files.

Example with custom column names:

data_file: {
  path: "load_data.parquet",
  columns: [
    {
      name: "county",           // Actual column name in the file
      dimension_type: "geography",  // Will be renamed to "geography"
    },
    {
      name: "end_use",          // Actual column name in the file
      dimension_type: "metric",     // Will be renamed to "metric"
    },
    {
      name: "building_type",    // Actual column name in the file
      dimension_type: "subsector",  // Will be renamed to "subsector"
    },
  ],
}

You can combine dimension_type with data_type when using CSV files:

data_file: {
  path: "load_data.csv",
  columns: [
    {
      name: "fips_code",
      data_type: "STRING",
      dimension_type: "geography",
    },
    {
      name: "value",
      data_type: "DOUBLE",
    },
  ],
}

Ignoring Columns

Data files may contain columns that are not needed for dsgrid processing. Rather than modifying your source files, you can tell dsgrid to ignore (drop) specific columns when reading the data.

To ignore columns, add an ignore_columns field to the data_file (or lookup_data_file) section in your dataset configuration. This field accepts a list of column names to drop.

This feature works for all file formats (Parquet, CSV).

Example with ignored columns:

data_file: {
  path: "load_data.parquet",
  ignore_columns: ["internal_id", "notes"],
}

You can combine ignore_columns with columns for type overrides and renaming:

data_file: {
  path: "load_data.csv",
  columns: [
    {
      name: "county",
      data_type: "STRING",
      dimension_type: "geography",
    },
    {
      name: "value",
      data_type: "DOUBLE",
    },
  ],
  ignore_columns: ["internal_id", "notes"],
}

Note

A column cannot appear in both columns and ignore_columns — dsgrid will raise an error if there is any overlap.

Time

dsgrid automates timeseries data alignment by requiring datasets and projects to clearly specify time conventions in their respective configuration files and then applying standardized transformations to align dataset time with project time. Project time is typically hourly, aligned with one or more weather years, and defined over a specific time zone that uses standard time (no daylight saving time). For interoperability, dsgrid stores timestamps in UTC.

Datasets might have timestamps already aligned for a single time zone, or they might be in local standard time, they might use representative time (daily load shapes for weekdays and weekends varying by month), etc. The intention is for dsgrid to handle the time transformations implied by dataset-project time mismatches, because these are typically a time sink and common source of errors. Thus if your dataset’s time convention is not yet supported, please contact the dsgrid team to discuss.

Time Zones

Timestamps must be converted to UTC when written to the Parquet files. Do not use the Pandas feature where it records time zone information into the Parquet metadata.

We recommend that you use Spark to create the Parquet files, but that is not required. If you do use Spark, note the following:

  • Spark implicitly interprets timestamps in the time zone of the current SQL session and converts them to UTC when writing dataframes to Parquet.

  • You can override the SQL session time zone programmatically or in your Spark configuration file. The setting is spark.sql.session.timeZone.

Time Zone Aware Timestamps

dsgrid can convert timestamps in data tables to the proper time zone by looking up the time dimension.

Time Zone Unaware Timestamps

Time-zone-unaware timestamps that will be interpreted as local time should be written as UTC timestamps (i.e., 12pm with no time zone should be written as 12pm UTC).

To resolve local times to absolute timestamps, dsgrid needs a time zone for each geographic location. There are two ways to provide this:

  • Dataset geography records (default) — include a time_zone column with IANA time zone strings (e.g., US/Eastern) in the dataset’s geography dimension records file.

  • Project geography records — set use_project_geography_time_zone: true in the dataset config. dsgrid will look up time zones from the project’s geography dimension instead. This is useful when the dataset’s geography records do not include time zone information.

See Dataset Concepts — Configuration Options for more on this setting.

Time Formats

DateTime

Datetime time dimensions use the column_format field in the time dimension config to describe how timestamps are represented in the data files. Three formats are supported:

Timezone-aware timestamps (TIMESTAMP_TZ)

This is the default. The load data table has one column representing time (typically called timestamp). When written to Parquet files the type should be the TIMESTAMP logical type (integer, not string) and be adjusted to UTC. When read into Spark the type should be TimestampType.

// Time dimension config (default — can be omitted)
column_format: {
  dtype: "TIMESTAMP_TZ",
  time_column: "timestamp",
}

The one-table and two-table examples above both use this format — note the UTC-offset suffix on each timestamp (e.g., 2012-01-01T00:00:00+00:00).

Timezone-naive timestamps (TIMESTAMP_NTZ)

For datasets where timestamps are local time without an explicit time zone. The data must include a time_zone column (or the time zone must be derivable from the geography dimension) so dsgrid can localize the values. When read into Spark the type should be TimestampNTZType.

column_format: {
  dtype: "TIMESTAMP_NTZ",
  time_column: "timestamp",
}

In the data file the timestamp column contains naive timestamps with no time zone suffix:

TIMESTAMP_NTZ data file example (two-table, pivoted)

timestamp

id

heating

cooling

2012-01-01 00:00:00

1

0.214

0.002

2012-01-01 01:00:00

1

0.329

0.000

Time-in-parts

For datasets that store time components in separate columns (year, month, day, and optionally hour) rather than a single timestamp column. During registration, dsgrid automatically converts these columns into a single timestamp column.

column_format: {
  dtype: "time_format_in_parts",
  year_column: "year",
  month_column: "month",
  day_column: "day",
  hour_column: "hour",    // optional — omit for daily data
  time_zone: "US/Eastern", // optional — if omitted, derived from geography
}

When time_zone is specified, the resulting timestamps are timezone-aware (TIMESTAMP_TZ) and the year, month, day, and hour columns are assumed to already be aligned in the specified time zone (US/Eastern in the example).

When time_zone is omitted, dsgrid assumes that year, month, day, and hour correspond to local time. To create specific timestamps in this case, dsgrid joins the geography dimension’s records onto the data and pulls in the time_zone column — this requires that the geography dimension’s records file includes a column named time_zone containing IANA time zone strings. The resulting data has a naive timestamp column (TIMESTAMP_NTZ) alongside the time_zone column, such that the combination is a fully specified point in time.

In the data file, time components appear as separate integer columns:

Time-in-parts data file example (two-table, stacked)

year

month

day

hour

id

metric

value

2012

1

1

0

1

heating

0.214

2012

1

1

0

1

cooling

0.002

2012

1

1

1

1

heating

0.329

Note

The time-in-parts columns are dropped from the data during registration and replaced by a single timestamp column. The original data files are not modified.

Datetime with External Time Zone

For datasets where timestamps are timezone-naive and the time zone is derived from the geography dimension. This uses a dedicated time dimension model (DatetimeExternalTimeZoneDimensionModel) rather than the column_format field used by the standard DateTime format.

The data files look the same as TIMESTAMP_NTZ — naive timestamps with no time zone suffix. The geography dimension’s records must include a time_zone column with IANA time zone strings so dsgrid can localize the timestamps.

// Time dimension config
time_type: "datetime_external_tz",
time_zone_format: {
  format_type: "aligned_in_clock_time",
  time_zones: ["US/Eastern", "US/Central", "US/Mountain", "US/Pacific"],
},
ranges: [{
  start: "2012-01-01 00:00:00",
  end: "2012-12-31 23:00:00",
  str_format: "%Y-%m-%d %H:%M:%S",
  frequency: "01:00:00",
}],
time_interval_type: "period_beginning",
measurement_type: "total",

See DatetimeExternalTimeZoneDimensionModel for all config fields.

Annual

Annual time dimensions contain one value per year per dimension combination. The time dimension config specifies the range of years:

// Time dimension config
time_type: "annual",
ranges: [{
  start: "2020",
  end: "2030",
  str_format: "%Y",
  frequency: 1,
}],
measurement_type: "total",

The data file includes a model_year column with integer year values:

Annual data file example (one-table, stacked)

model_year

geography

subsector

metric

value

2020

01001

rooftop_pv

capacity_kw

10.5

2021

01001

rooftop_pv

capacity_kw

25.3

2022

01001

rooftop_pv

capacity_kw

42.7

2020

01003

rooftop_pv

capacity_kw

8.1

See AnnualTimeDimensionModel for all config fields.

Index

Index time dimensions use integer time steps rather than timestamp values. The time dimension config maps each index to a point in time via a starting timestamp and frequency:

// Time dimension config
time_type: "index",
ranges: [{
  start: 0,
  end: 8783,
  starting_timestamp: "2012-01-01 00:00:00",
  str_format: "%Y-%m-%d %H:%M:%S",
  frequency: "P0DT1H0M0.000000S",  // 1 hour
}],
time_interval_type: "period_beginning",
measurement_type: "total",

In the data file the time column is an integer named time_index:

Index data file example (two-table, pivoted)

time_index

id

heating

cooling

0

1

0.214

0.002

1

1

0.329

0.000

2

1

0.369

0.000

8783

1

0.198

0.001

During registration, dsgrid converts the integer indices to timestamps using the config’s starting timestamp and frequency. The original integer column is replaced by a proper timestamp column in the registered data.

See IndexTimeDimensionModel for all config fields.

Representative Period

Metric data contains timestamps that represent multiple periods. dsgrid supports the following formats:

one_week_per_month_by_hour

Each time array contains one week of hourly data (24 hours per day) that applies to an entire month. The times represent local time (no time zone). There are no shifts, missing hours, or extra hours for daylight savings time.

  • All time columns must be integers.

  • month is one-based, starting in January. Jan → 1, Feb → 2, etc.

  • day_of_week is zero-based, starting on Monday. Mon → 0, Tue → 1, etc.

  • hour is zero-based, starting at midnight.

one_week_per_month_by_hour (two-table, pivoted metrics)

id

month

day_of_week

hour

L1andL2

DCFC

1

1

0

0

0.523

0.102

1

1

0

1

0.498

0.095

1

1

0

2

0.412

0.078

1

1

0

3

0.387

0.064

1

1

6

23

0.601

0.134

1

2

0

0

0.545

0.110

Note

This example uses the two-table format with pivoted metrics (L1andL2, DCFC). The id column links to the lookup table, which maps each ID to a combination of geography, scenario, subsector, etc. Only a subset of rows is shown — a complete dataset would have 2,016 rows per ID (12 months × 7 days × 24 hours).

one_weekday_day_and_one_weekend_day_per_month_by_hour

Each time array contains one representative weekday and one representative weekend day of hourly data for each month. The times represent local time (no time zone).

  • All time columns must be integers except is_weekday, which is boolean.

  • month is one-based, starting in January. Jan → 1, Feb → 2, etc.

  • hour is zero-based, starting at midnight.

  • is_weekday is true for the weekday profile and false for the weekend profile.

one_weekday_day_and_one_weekend_day_per_month_by_hour (one-table, stacked)

month

is_weekday

hour

geography

metric

value

1

false

0

06037

electricity

0.594

1

false

1

06037

electricity

0.162

1

false

2

06037

electricity

0.379

1

true

0

06037

electricity

0.612

Note

A complete dataset has 576 time rows per dimension combination (12 months × 2 day types × 24 hours).

See RepresentativePeriodTimeDimensionModel and RepresentativePeriodFormat for all config fields and supported formats.

NoOp (No Time Dimension)

For time-invariant datasets that have no time component — for example, annual growth factors or static capacity values. The data files contain no time columns.

// Time dimension config
time_type: "noop",
NoOp data file example (one-table, stacked)

geography

subsector

metric

value

01001

full_service_restaurant

growth_factor

1.05

01001

primary_school

growth_factor

1.02

01003

full_service_restaurant

growth_factor

1.08

See NoOpTimeDimensionModel for all config fields.