dsgrid.dataformat package

Submodules

dsgrid.dataformat.datafile module

class dsgrid.dataformat.datafile.Datafile(filepath, sector_enum, geography_enum, enduse_enum, time_enum, loading=False, version='0.4.0')[source]

Bases: collections.abc.Mapping

Create a new Datafile object. Use Datafile.load to open existing files.

Parameters
  • filepath (str) -- file to create. typically has a .dsg file extension.

  • sector_enum (dsgrid.dataformat.enumeration.SectorEnumeration) -- enumeration of sectors to be stored in this Datafile

  • geography_enum (dsgrid.dataformat.enumeration.GeographyEnumeration) -- enumeration of geographies for this Datafile. typically these are geographical units at the same level of resolution. the Datafile does not have to specify values for every geography.

  • enduse_enum (dsgrid.dataformat.enumeration.EndUseEnumerationBase) -- enumeration of end-uses. there are mutiple EndUseEnumerationBase class types. typically one would use SingeFuelEndUseEnumeration or MultiFuelEndUseEnumeration.

  • time_enum (dsgrid.dataformat.enumeration.TimeEnumeration) -- enumeration specifying the time resolution of this Datafile

  • loading (bool) -- NOT FOR GENERAL USE -- Use Datafile.load to open existing files.

  • version (str) -- NOT FOR GENERAL USE -- New file are marked with the current VERSION. The load and update methods are used to manage version indicators for backward compatibility.

contains(an_enum)[source]
classmethod load(filepath, upgrade=True, overwrite=False, new_filepath=None, **kwargs)[source]
upgrade(OLD_VERSIONS, overwrite=False, new_filepath=None)[source]

Upgrade this Datafile to the latest version. This method should not usually be called directly; it is called by Datafile.load if upgrade is True.

Parameters
  • OLD_VERSIONS (OrderedDict of {version string : dsgrid.dataformat.upgrade.UpgradeDataFile}) -- import from dsgrid.dataformat.upgrade

  • overwrite (bool) -- if True, the upgraded Datafile overwrites the original

  • new_filepath (str) -- if not overwrite and new_filepath is not None, the new file is saved to new_filepath. If not overwrite and filepath is None then the upgraded file is saved to the same directory as the new file, with the new version number appended to the filename, and the extension '.dsg'

save(filepath)[source]

Save self to filepath and return newly created Datafile

add_sector(sector_id, enduses=None, times=None)[source]

Adds a SectorDataset to this file and returns it.

map_dimension(filepath, mapping)[source]
scale_data(filepath, factor=0.001)[source]

Scale all the data in self by factor, creating a new HDF5 file and corresponding Datafile.

Parameters
  • str (filepath |) -- Location for the new HDF5 file to be created

  • float (factor |) -- Factor by which all the data in the file is to be multiplied. The default value of 0.001 corresponds to converting the bottom-up data from kWh to MWh

dsgrid.dataformat.datatable module

class dsgrid.dataformat.datatable.Datatable(datafile, sort=True, verify_integrity=True)[source]

Bases: object

sort()[source]

dsgrid.dataformat.dimmap module

class dsgrid.dataformat.dimmap.DimensionMap(from_enum, to_enum)[source]

Bases: object

map(from_id)[source]

Returns the appropriate to_id.

scale_factor(from_id)[source]
class dsgrid.dataformat.dimmap.TautologyMapping(from_to_enum)[source]

Bases: dsgrid.dataformat.dimmap.DimensionMap

map(from_id)[source]

Returns the appropriate to_id.

class dsgrid.dataformat.dimmap.FullAggregationMap(from_enum, to_enum, exclude_list=[])[source]

Bases: dsgrid.dataformat.dimmap.DimensionMap

Parameters
map(from_id)[source]

Returns the appropriate to_id.

class dsgrid.dataformat.dimmap.FilterToSubsetMap(from_enum, to_enum)[source]

Bases: dsgrid.dataformat.dimmap.DimensionMap

Parameters

to_enum (-) --

map(from_id)[source]

Returns the appropriate to_id.

class dsgrid.dataformat.dimmap.FilterToSingleFuelMap(from_enum, fuel_to_keep)[source]

Bases: dsgrid.dataformat.dimmap.DimensionMap

map(from_id)[source]

Returns the appropriate to_id.

class dsgrid.dataformat.dimmap.ExplicitMap(from_enum, to_enum, dictmap)[source]

Bases: dsgrid.dataformat.dimmap.DimensionMap

map(from_id)[source]

Returns the appropriate to_id.

classmethod create_from_csv(from_enum, to_enum, filepath)[source]
class dsgrid.dataformat.dimmap.ExplicitDisaggregation(from_enum, to_enum, dictmap, scaling_datafile=None)[source]

Bases: dsgrid.dataformat.dimmap.ExplicitMap

If no scaling_datafile, scaling factors are assumed to be 1.0.

property default_scaling
property scaling_datatable
get_scalings(to_ids)[source]

Return an array of scalings for to_ids.

classmethod create_from_csv(from_enum, to_enum, filepath, scaling_datafile=None)[source]
class dsgrid.dataformat.dimmap.ExplicitAggregation(from_enum, to_enum, dictmap)[source]

Bases: dsgrid.dataformat.dimmap.ExplicitMap

class dsgrid.dataformat.dimmap.UnitConversionMap(from_enum, from_units, to_units)[source]

Bases: dsgrid.dataformat.dimmap.DimensionMap

Convert from_units to to_units.

Parameters
  • from_enum (EndUseEnumerationBase) --

  • from_units (list of str) -- List of units in from_enum that are to be converted

  • to_units (list of str) -- List of units to convert to. Same length list as from_units.

CONVERSION_FACTORS = {('GWh', 'TWh'): 0.001, ('MWh', 'GWh'): 0.001, ('kWh', 'MWh'): 0.001}
map(from_id)[source]

Returns the appropriate to_id.

scale_factor(from_id)[source]
classmethod scaling_factor(from_unit, to_unit)[source]
class dsgrid.dataformat.dimmap.Mappings[source]

Bases: object

add_mapping(mapping)[source]
get_mapping(datafile, to_enum)[source]

dsgrid.dataformat.enumeration module

class dsgrid.dataformat.enumeration.Enumeration(name, ids, names)[source]

Bases: object

max_id_len = 64
max_name_len = 128
enum_dtype = dtype([('id', 'S64'), ('name', 'S128')])
dimension = None
checkvalues()[source]
get_name(id)[source]
create_subset_enum(ids)[source]

Returns a new enumeration that is a subset of this one, based on keeping the items in ids.

Parameters

ids (list) -- subset of self.ids that should be kept in the new enumeration

Returns

Return type

self.__class__

is_subset(other_enum)[source]

Returns true if this Enumeration is a subset of other_enum.

persist(h5group)[source]
classmethod load(h5group)[source]
classmethod read_csv(filepath, name=None)[source]
to_csv(filedir=None, filepath=None, overwrite=False)[source]
class dsgrid.dataformat.enumeration.SectorEnumeration(name, ids, names)[source]

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'sector'
class dsgrid.dataformat.enumeration.GeographyEnumeration(name, ids, names)[source]

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'geography'
class dsgrid.dataformat.enumeration.EndUseEnumerationBase(name, ids, names)[source]

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'enduse'
fuel(id)[source]
units(id)[source]
classmethod load(h5group)[source]
classmethod read_csv(filepath, name=None)[source]

Infer and read into the correct derived class.

class dsgrid.dataformat.enumeration.TimeEnumeration(name, ids, names)[source]

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'time'
class TIMESTAMP_POSITION(value)

Bases: enum.Enum

An enumeration.

period_beginning = 1
period_midpoint = 2
period_ending = 3
TIMEZONE_DISPLAY_NAMES = {'Etc/GMT+5': 'EST', 'Etc/GMT+6': 'CST', 'Etc/GMT+7': 'MST', 'Etc/GMT+8': 'PST'}
TIMEZONE_LOOKUP = {'CST': 'Etc/GMT+6', 'EST': 'Etc/GMT+5', 'MST': 'Etc/GMT+7', 'PST': 'Etc/GMT+8'}
classmethod create(enum_name, start, duration, resolution, extent_timezone=<UTC>, store_timezone=None, timestamp_position=TIMESTAMP_POSITION.period_ending)[source]

Create a new time enumeration based on the specified temporal extents, resolution, and timezone.

Parameters
  • enum_name (str) -- name for this enumeration, ideally descriptive of the parameters used for creation

  • start (datetime.datetime) -- beginning of the time period to be represented by the timestamps

  • duration (datetime.timedelta) -- total length of time to be covered

  • resolution (datetime.timedelta) -- timestep for the enumeration

  • extent_timezone (pytz.timezone) -- timezone that should be used to interpret the extent parameters

  • store_timezone (None or pytz.timezone) -- timezone to write the ids and names in. If None, extent_timezone is used.

  • timestamp_position (TimeEnumeration.TIMESTAMP_POSITION or convertable str) -- whether timestamps are placed at the beginning, ending, or midpoint of the time period being described

Returns

Return type

TimeEnumeration

property store_timezone

Examines the first id to determine what timezone this TimeEnumeration is stored in. Assumes the usage of datetime, pytz, and the "standard" timezones, e.g.,

  • pytz.timezone('Etc/GMT+5') = EST

  • pytz.timezone('Etc/GMT+6') = CST

  • pytz.timezone('Etc/GMT+7') = MST

  • pytz.timezone('Etc/GMT+8') = PST

property store_timezone_display_name

Interprets self.ids[0] to report what timezone this enumeration is stored in. Converts from pytz strings to what we typically use, namely EST, CST, MST, or PST.

Returns

timezone this TimeEnumeration is stored in, per self.store_timezone and self.TIMEZONE_DISPLAY_NAMES

Return type

str

property resolution

The resolution of this TimeEnumeration.

Returns

Returns a single value if the intervals are all of the same length. Returns a vector of values if they are different.

Return type

dt.timedelta or array of dt.timedelta

get_extents(report_timezone=None, timestamp_position=TIMESTAMP_POSITION.period_ending)[source]

Returns the inclusive temporal extents represented in this TimeEnumeration. That interpretation requires knowledge of the timestamp_postion--beginning, end, or midpoint of the period being described.

Parameters

report_timezone (pytz.timezone) -- Timezone in which to report out the result

Returns

Tuple of start and end times, inclusive of all time represented based on the timestamp position, and in report_timezone.

Return type

(datetime.datetime,datetime.datetime)

to_datetime_index(return_timezone=None)[source]

Return a Pandas DatetimeIndex corresponding to this TimeEnumeration. By default, localizes the timestamps to the timezone inferred based on the text of the first enumeration id. If return_timezpone is None, this is what is returned. If return_timezone is not None, the index is converted to that timezone before being returned.

Parameters

return_timezone (None or pytz.timezone) -- timezone of the returned index. If None, this is inferred from self.ids[0]

Returns

same length as self.ids, but strings are converted to datetime.datetime objects and localized to a timezone.

Return type

pandas.DatetimeIndex

get_datetime_map(return_timezone=None)[source]

Converts self.ids and result of to_datetime_index into dict that can be used to map ids to datetimes in contexts other than a single DataFrame index.

Parameters

return_timezone (None or pytz.timezone) -- timezone of the returned index. If None, this is inferred from self.ids[0]

Returns

{id: localized datetime}

Return type

dict

class dsgrid.dataformat.enumeration.EndUseEnumeration(name, ids, names)[source]

Bases: dsgrid.dataformat.enumeration.EndUseEnumerationBase

Provided for backward compatibility with dsgrid v0.1.0 datasets.

fuel(id)[source]
units(id)[source]
classmethod read_csv(filepath, name=None)[source]

Infer and read into the correct derived class.

class dsgrid.dataformat.enumeration.SingleFuelEndUseEnumeration(name, ids, names, fuel='Electricity', units='MWh')[source]

Bases: dsgrid.dataformat.enumeration.EndUseEnumerationBase

If the end-use enumeration only applies to a single fuel type, and all the data is in the same units, just give the fuel and units.

fuel(id)[source]
units(id)[source]
create_subset_enum(ids)[source]

Returns a new enumeration that is a subset of this one, based on keeping the items in ids.

Parameters

ids (list) -- subset of self.ids that should be kept in the new enumeration

Returns

Return type

self.__class__

persist(h5group)[source]
classmethod read_csv(filepath, name=None, fuel='Electricity', units='MWh')[source]

Infer and read into the correct derived class.

to_csv(filedir=None, filepath=None, overwrite=False)[source]
class dsgrid.dataformat.enumeration.FuelEnumeration(name, ids, names, units)[source]

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'fuel'
enum_dtype = dtype([('id', 'S64'), ('name', 'S128'), ('units', 'S64')])
checkvalues()[source]
get_units(id)[source]
create_subset_enum(ids)[source]

Returns a new enumeration that is a subset of this one, based on keeping the items in ids.

Parameters

ids (list) -- subset of self.ids that should be kept in the new enumeration

Returns

Return type

self.__class__

persist(h5group)[source]
classmethod load(h5group)[source]
classmethod read_csv(filepath, name=None)[source]
to_csv(filedir=None, filepath=None, overwrite=False)[source]
class dsgrid.dataformat.enumeration.MultiFuelEndUseEnumeration(name, ids, names, fuel_enum, fuel_ids)[source]

Bases: dsgrid.dataformat.enumeration.EndUseEnumerationBase

enum_dtype = dtype([('id', 'S64'), ('name', 'S128'), ('fuel_id', 'S64')])
checkvalues()[source]
property ids
property names
fuel(id)[source]
units(id)[source]
create_subset_enum(ids)[source]

Returns a new enumeration that is a subset of this one, based on keeping the items in ids.

Parameters

ids (list of 2-tuples) -- subset of self.ids that should be kept in the new enumeration

Returns

Return type

MultiFuelEndUseEnumeration

persist(h5group)[source]
classmethod load(h5group)[source]
classmethod read_csv(filepath, name=None, fuel_enum=None)[source]

id, name, fuel_id + pass in file_enum

or

id, name, fuel_id, fuel_name, units

or

id, name, fuel_id, units (and fuel_name will be guessed from fuel_id)

to_csv(filedir=None, filepath=None, overwrite=False)[source]

dsgrid.dataformat.sectordataset module

class dsgrid.dataformat.sectordataset.Datamap(value)[source]

Bases: object

Map between Datafile-level enumeration (enum) and Sectordataset-level sub-enumeration (enum_ids). Sub-enumeration may also have non-unity scaling factors. Per Sectordataset, these Datamaps link each enumeration value with:

  1. a particular index in the dataset along the Enumeration's dimension; and

  2. a scaling parameter to apply to the associated underlying data.

Multiple enumeration values can refer to the same index in the dataset's enumeration dimension, with the option to apply different scaling factors. The index is represented as a 32-bit unsigned integer, which limits dataset size to 2^32 - 2 in each dimension, with NULL_IDX (2^32 - 1) serving as the sentinel value assigned to enumeration values not described in the dataset (looking up data associated with such an enumeration value will simply return zeros)

value

datamap vector of length len(enum.ids) with 'idx' and 'scale' dimensions. For the example of (j, scale) = self.value[i],

  • i = position of enum_id in enum.ids (datafile-level enum)

  • j = position of enum_id in enum_ids (sectordataset-level sub-enum)

  • scale = scaling factor to apply to this enumeration element

We also have j = self.value[i]['idx'], scale = self.value[i]['scale'].

Type

numpy.ndarray

classmethod create(enum, enum_ids, enum_scales=None)[source]
Parameters
  • enum (dsgrid.enumeration.Enumeration) --

  • enum_ids (list) -- List of items in enum.ids

  • enum_scales (None or list) -- if list, is list of floats the same length as enum_ids

Returns

Return type

Datamap

classmethod load(dataset)[source]
Parameters

dataset (h5py.Dataset) -- a Datamap serialized to h5py

Returns

Return type

Datamap

update(dataset)[source]

Updates dataset with this Datamap's value. Overwrites current dataset[:,'idx'] and dataset[:,'scale'].

Parameters

dataset (h5py.Dataset) -- a Datamap serialized to h5py

property num_entries

The number of non-null idx values in this Datamap's value. Corresponds, e.g. to the number of number of distinct (up to a scaling factor) entries along a dimension.

get_subenum(enum)[source]
Parameters

enum (dsgrid.dataformat.enumeration.Enumeration) -- Datafile-level enumeration

Returns

ids in enum for which there is data in this SectorDataset. The ids are returned in the order imposed by the SectorData In the correct order for this Sectordataset

Return type

list of enum.ids

is_empty(enum_id, enum)[source]
Parameters
Returns

True if the dataset has no data for enum_id, False otherwise

Return type

bool

get_map(enum)[source]

Get the data in this map in ordered, hashed form, with the dataset-level idx as the key.

Parameters

enum (dsgrid.dataformat.enumeration.Enumeration) -- Datafile-level enumeration

Returns

idx: (list of enum.ids, scales)

Return type

OrderedDict

ids(idx, enum)[source]

Returns the enum.ids for that are mapped to the dataset idx

Parameters
  • idx (int) -- dataset-level sub-enumeration index

  • enum (dsgrid.dataformat.Enumeration) -- datafile-level enumeration the sub-enum is based on

Returns

in particular, the ones that are mapped to idx, in the order specified by enum.ids

Return type

list of enum.ids

scales(idx)[source]

Returns the scaling factors that correspond to the dataset idx

Parameters

idx (int) -- dataset-level sub-enumeration index

Returns

one for each of the .ids(idx,enum), and in the same order

Return type

list of float

append_element(new_elem_idx, enum_ids, enum, scalings=[])[source]

Appends a new non-null element for this Datamap that defines the index (new_elem_idx) for data that corresponds to enum_ids in enum.

Parameters
  • new_elem_idx (int) -- index value for the new element

  • enum_ids (list) -- list of distinct elements in enum.ids

  • enum (dsgrid.dataformat.enumeration.Enumeration) -- should be same Enumeration originally used to .create this Datamap

  • scalinges (list of numeric) -- if empty, will be defaulted to 1.0. otherwise should be the same length as enum_ids

dsgrid.dataformat.sectordataset.append_element_to_dataset_dimension(dataset, new_elem_idx, enum_ids, enum, scalings=[])[source]

Helper method to do all the work of adding a new element to a SectorDataset dimenstion.

Parameters
  • dataset (h5py.Dataset) -- a Datamap serialized to h5py

  • new_elem_idx (int) -- index value for the new element

  • enum_ids (list) -- list of distinct elements in enum.ids

  • enum (dsgrid.dataformat.enumeration.Enumeration) -- should be same Enumeration originally used to .create this Datamap

  • scalinges (list of numeric) -- if empty, will be defaulted to 1.0. otherwise should be the same length as enum_ids

class dsgrid.dataformat.sectordataset.SectorDataset(datafile, sector_id, enduses, times)[source]

Bases: object

Creates a SectorDataset object. Note that this does not read from or write to datafile in any way, and should generally not be called directly. Instead, use the SectorDataset.load or SectorDataset.new class methods.

classmethod new(datafile, sector_id, enduses=None, times=None)[source]
classmethod load(datafile, f, sector_id)[source]
classmethod loadall(datafile, f, _upgrade_class=None)[source]
add_data_batch(dataframes, geo_ids, scalings=None, full_validation=True)[source]

Add a batch of new data to this SectorDataset. Uses the basic add_data functionality, but handles the h5 file so as to write data to memory first and only write to disk upon closing.

Parameters
  • dataframes (iterable) -- One dataframe per call to add_data

  • geo_ids (list) -- List of geo_ids arguments to pass to add_data

  • scalings (None or list of lists) -- If None, [] will be passed in to each call of add_data. Otherwise, this must be a list of the scalings arguments to pass, which must be lists of the same size as the geo_ids argument for that call.

  • full_validation (bool) -- If true, checks that all enumeration ids (time, enduse, and geography) are valid. If false, does this, but only for the first item.

add_data(dataframe, geo_ids, scalings=[], full_validation=True, _batch_file_object=None)[source]

Add new data to this SectorDataset, as part of the self.datafile HDF5.

Parameters
  • dataframe (pandas.DataFrame) -- Data to add, indexed by times, and with columns equal to enduses.

  • geo_ids (id or list of ids) -- Ids map to datafile.geo_enum

  • scalings (list of float) -- If non-empty, must be same length as geo_ids and represents the scaling factors for the geo_ids in order. Otherwise, a uniform value of 1.0 is assumed for all geo_ids.

  • full_validation (bool) -- If true, checks that all enumeration ids (time, enduse, and geography) are valid.

has_data(geo_id)[source]
get_datamap(dim_key)[source]
get_data(dataset_geo_index)[source]

Get data in this file's native format.

Parameters

dataset_geo_index (int) -- Index into the geography dimension of this dataset. Is an integer in the range [0,self.n_geos) that corresponds to the values in this dataset's geographies[:,'idx'] that are not equal to NULL_IDX

Returns

  • pandas.DataFrame -- data indexed by time and differentiated by enduse (as columns)

  • list of .datafile.geo_enum.ids -- geographic enum values this data applies to

  • list of float -- one scaling factor for each geographic enum value

copy_data(other_sectordataset, full_validation=True)[source]

Copy data from this SectorDataset into other_sectordataset.

Parameters
  • other_sectordataset (SectorDataset) -- target for this SectorDataset's data to be copied into

  • full_validation (bool) -- flag for SectorDataset.add_data

map_dimension(new_datafile, mapping)[source]
scale_data(new_datafile, factor=0.001)[source]

Scale all the data in self by factor, creating a new HDF5 file and corresponding Datafile.

Parameters
  • filepath (str) -- Location for the new HDF5 file to be created

  • factor (float) -- Factor by which all the data in the file is to be multiplied. The default value of 0.001 corresponds to converting the bottom-up data from kWh to MWh.

dsgrid.dataformat.upgrade module

class dsgrid.dataformat.upgrade.UpgradeDatafile[source]

Bases: object

from_version = None
to_version = None
classmethod upgrade(datafile, f)[source]
classmethod load_datafile(filepath)[source]

Load enough to return a Datafile object. Object should not be expected to be fully functional.

Parameters

filepath (str) -- path to Datafile

Returns

(partially) loaded Datafile in old format

Return type

dsgrid.dataformat.datafile.Datafile

classmethod load_sectordataset(datafile, f, sector_id)[source]

Load enough to return a SectorDataset object. Object should not be expected to be fully functional.

class dsgrid.dataformat.upgrade.DSG_0_1_0[source]

Bases: dsgrid.dataformat.upgrade.UpgradeDatafile

from_version = '0.1.0'
to_version = '0.2.0'
ZERO_IDX = 65535
classmethod load_sectordataset(datafile, f, sector_id)[source]

Load enough to return a SectorDataset object. Object should not be expected to be fully functional.

dsgrid.dataformat.upgrade.make_fuel_and_units_explicit(datafile, filepath, fuel='Electricity', units='MWh')[source]

Module contents

dsgrid.dataformat.get_str(a_str_or_bytes)[source]