dsgrid.dataformat package¶

Submodules¶

dsgrid.dataformat.datafile module¶

class dsgrid.dataformat.datafile.Datafile(filepath, sector_enum, geography_enum, enduse_enum, time_enum, loading=False, version='0.4.0')[source]¶

Bases: collections.abc.Mapping

Create a new Datafile object. Use Datafile.load to open existing files.

Parameters

filepath (str) -- file to create. typically has a .dsg file extension.
sector_enum (dsgrid.dataformat.enumeration.SectorEnumeration) -- enumeration of sectors to be stored in this Datafile
geography_enum (dsgrid.dataformat.enumeration.GeographyEnumeration) -- enumeration of geographies for this Datafile. typically these are geographical units at the same level of resolution. the Datafile does not have to specify values for every geography.
enduse_enum (dsgrid.dataformat.enumeration.EndUseEnumerationBase) -- enumeration of end-uses. there are mutiple EndUseEnumerationBase class types. typically one would use SingeFuelEndUseEnumeration or MultiFuelEndUseEnumeration.
time_enum (dsgrid.dataformat.enumeration.TimeEnumeration) -- enumeration specifying the time resolution of this Datafile
loading (bool) -- NOT FOR GENERAL USE -- Use Datafile.load to open existing files.
version (str) -- NOT FOR GENERAL USE -- New file are marked with the current VERSION. The load and update methods are used to manage version indicators for backward compatibility.

contains(an_enum)[source]¶

classmethod load(filepath, upgrade=True, overwrite=False, new_filepath=None, **kwargs)[source]¶

upgrade(OLD_VERSIONS, overwrite=False, new_filepath=None)[source]¶

Upgrade this Datafile to the latest version. This method should not usually be called directly; it is called by Datafile.load if upgrade is True.

Parameters

OLD_VERSIONS (OrderedDict of {version string : dsgrid.dataformat.upgrade.UpgradeDataFile}) -- import from dsgrid.dataformat.upgrade
overwrite (bool) -- if True, the upgraded Datafile overwrites the original
new_filepath (str) -- if not overwrite and new_filepath is not None, the new file is saved to new_filepath. If not overwrite and filepath is None then the upgraded file is saved to the same directory as the new file, with the new version number appended to the filename, and the extension '.dsg'

save(filepath)[source]¶: Save self to filepath and return newly created Datafile

add_sector(sector_id, enduses=None, times=None)[source]¶: Adds a SectorDataset to this file and returns it.

map_dimension(filepath, mapping)[source]¶

scale_data(filepath, factor=0.001)[source]¶

Scale all the data in self by factor, creating a new HDF5 file and corresponding Datafile.

Parameters

str (filepath |) -- Location for the new HDF5 file to be created
float (factor |) -- Factor by which all the data in the file is to be multiplied. The default value of 0.001 corresponds to converting the bottom-up data from kWh to MWh

dsgrid.dataformat.datatable module¶

class dsgrid.dataformat.datatable.Datatable(datafile, sort=True, verify_integrity=True)[source]¶

Bases: object

sort()[source]¶

dsgrid.dataformat.dimmap module¶

class dsgrid.dataformat.dimmap.DimensionMap(from_enum, to_enum)[source]¶

Bases: object

map(from_id)[source]¶: Returns the appropriate to_id.

scale_factor(from_id)[source]¶

class dsgrid.dataformat.dimmap.TautologyMapping(from_to_enum)[source]¶

Bases: dsgrid.dataformat.dimmap.DimensionMap

map(from_id)[source]¶: Returns the appropriate to_id.

class dsgrid.dataformat.dimmap.FullAggregationMap(from_enum, to_enum, exclude_list=[])[source]¶

Bases: dsgrid.dataformat.dimmap.DimensionMap

Parameters

from_enum (dsgrid.dataformat.enumeration.Enumeration) --
to_enum (dsgrid.dataformat.enumeration.Enumeration) -- Class must correspond to the same dimension as from_enum, and the enumeration must have exactly one element
exclude_list (list of from_enum.ids) -- from_enum values that should be dropped from the aggregation

map(from_id)[source]¶: Returns the appropriate to_id.

class dsgrid.dataformat.dimmap.FilterToSubsetMap(from_enum, to_enum)[source]¶

Bases: dsgrid.dataformat.dimmap.DimensionMap

Parameters: to_enum (-) --

map(from_id)[source]¶: Returns the appropriate to_id.

class dsgrid.dataformat.dimmap.FilterToSingleFuelMap(from_enum, fuel_to_keep)[source]¶

Bases: dsgrid.dataformat.dimmap.DimensionMap

map(from_id)[source]¶: Returns the appropriate to_id.

class dsgrid.dataformat.dimmap.ExplicitMap(from_enum, to_enum, dictmap)[source]¶

Bases: dsgrid.dataformat.dimmap.DimensionMap

map(from_id)[source]¶: Returns the appropriate to_id.

classmethod create_from_csv(from_enum, to_enum, filepath)[source]¶

class dsgrid.dataformat.dimmap.ExplicitDisaggregation(from_enum, to_enum, dictmap, scaling_datafile=None)[source]¶

Bases: dsgrid.dataformat.dimmap.ExplicitMap

If no scaling_datafile, scaling factors are assumed to be 1.0.

property default_scaling¶

property scaling_datatable¶

get_scalings(to_ids)[source]¶: Return an array of scalings for to_ids.

classmethod create_from_csv(from_enum, to_enum, filepath, scaling_datafile=None)[source]¶

class dsgrid.dataformat.dimmap.ExplicitAggregation(from_enum, to_enum, dictmap)[source]¶: Bases: dsgrid.dataformat.dimmap.ExplicitMap

class dsgrid.dataformat.dimmap.UnitConversionMap(from_enum, from_units, to_units)[source]¶

Bases: dsgrid.dataformat.dimmap.DimensionMap

Convert from_units to to_units.

Parameters

from_enum (EndUseEnumerationBase) --
from_units (list of str) -- List of units in from_enum that are to be converted
to_units (list of str) -- List of units to convert to. Same length list as from_units.

CONVERSION_FACTORS = {('GWh', 'TWh'): 0.001, ('MWh', 'GWh'): 0.001, ('kWh', 'MWh'): 0.001}¶

map(from_id)[source]¶: Returns the appropriate to_id.

scale_factor(from_id)[source]¶

classmethod scaling_factor(from_unit, to_unit)[source]¶

class dsgrid.dataformat.dimmap.Mappings[source]¶

Bases: object

add_mapping(mapping)[source]¶

get_mapping(datafile, to_enum)[source]¶

dsgrid.dataformat.enumeration module¶

class dsgrid.dataformat.enumeration.Enumeration(name, ids, names)[source]¶

Bases: object

max_id_len = 64¶

max_name_len = 128¶

enum_dtype = dtype([('id', 'S64'), ('name', 'S128')])¶

dimension = None¶

checkvalues()[source]¶

get_name(id)[source]¶

create_subset_enum(ids)[source]¶

Returns a new enumeration that is a subset of this one, based on keeping the items in ids.

Parameters: ids (list) -- subset of self.ids that should be kept in the new enumeration
Returns
Return type: self.__class__

is_subset(other_enum)[source]¶: Returns true if this Enumeration is a subset of other_enum.

persist(h5group)[source]¶

classmethod load(h5group)[source]¶

classmethod read_csv(filepath, name=None)[source]¶

to_csv(filedir=None, filepath=None, overwrite=False)[source]¶

class dsgrid.dataformat.enumeration.SectorEnumeration(name, ids, names)[source]¶

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'sector'¶

class dsgrid.dataformat.enumeration.GeographyEnumeration(name, ids, names)[source]¶

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'geography'¶

class dsgrid.dataformat.enumeration.EndUseEnumerationBase(name, ids, names)[source]¶

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'enduse'¶

fuel(id)[source]¶

units(id)[source]¶

classmethod load(h5group)[source]¶

classmethod read_csv(filepath, name=None)[source]¶: Infer and read into the correct derived class.

class dsgrid.dataformat.enumeration.TimeEnumeration(name, ids, names)[source]¶

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'time'¶

class TIMESTAMP_POSITION(value)¶

Bases: enum.Enum

An enumeration.

period_beginning = 1¶

period_midpoint = 2¶

period_ending = 3¶

TIMEZONE_DISPLAY_NAMES = {'Etc/GMT+5': 'EST', 'Etc/GMT+6': 'CST', 'Etc/GMT+7': 'MST', 'Etc/GMT+8': 'PST'}¶

TIMEZONE_LOOKUP = {'CST': 'Etc/GMT+6', 'EST': 'Etc/GMT+5', 'MST': 'Etc/GMT+7', 'PST': 'Etc/GMT+8'}¶

classmethod create(enum_name, start, duration, resolution, extent_timezone=<UTC>, store_timezone=None, timestamp_position=TIMESTAMP_POSITION.period_ending)[source]¶

Create a new time enumeration based on the specified temporal extents, resolution, and timezone.

Parameters

enum_name (str) -- name for this enumeration, ideally descriptive of the parameters used for creation
start (datetime.datetime) -- beginning of the time period to be represented by the timestamps
duration (datetime.timedelta) -- total length of time to be covered
resolution (datetime.timedelta) -- timestep for the enumeration
extent_timezone (pytz.timezone) -- timezone that should be used to interpret the extent parameters
store_timezone (None or pytz.timezone) -- timezone to write the ids and names in. If None, extent_timezone is used.
timestamp_position (TimeEnumeration.TIMESTAMP_POSITION or convertable str) -- whether timestamps are placed at the beginning, ending, or midpoint of the time period being described

Returns

Return type

TimeEnumeration

property store_timezone¶

Examines the first id to determine what timezone this TimeEnumeration is stored in. Assumes the usage of datetime, pytz, and the "standard" timezones, e.g.,

pytz.timezone('Etc/GMT+5') = EST

pytz.timezone('Etc/GMT+6') = CST

pytz.timezone('Etc/GMT+7') = MST

pytz.timezone('Etc/GMT+8') = PST

property store_timezone_display_name¶

Interprets self.ids[0] to report what timezone this enumeration is stored in. Converts from pytz strings to what we typically use, namely EST, CST, MST, or PST.

Returns: timezone this TimeEnumeration is stored in, per self.store_timezone and self.TIMEZONE_DISPLAY_NAMES
Return type: str

property resolution¶

The resolution of this TimeEnumeration.

Returns: Returns a single value if the intervals are all of the same length. Returns a vector of values if they are different.
Return type: dt.timedelta or array of dt.timedelta

get_extents(report_timezone=None, timestamp_position=TIMESTAMP_POSITION.period_ending)[source]¶

Returns the inclusive temporal extents represented in this TimeEnumeration. That interpretation requires knowledge of the timestamp_postion--beginning, end, or midpoint of the period being described.

Parameters: report_timezone (pytz.timezone) -- Timezone in which to report out the result
Returns: Tuple of start and end times, inclusive of all time represented based on the timestamp position, and in report_timezone.
Return type: (datetime.datetime,datetime.datetime)

to_datetime_index(return_timezone=None)[source]¶

Return a Pandas DatetimeIndex corresponding to this TimeEnumeration. By default, localizes the timestamps to the timezone inferred based on the text of the first enumeration id. If return_timezpone is None, this is what is returned. If return_timezone is not None, the index is converted to that timezone before being returned.

Parameters: return_timezone (None or pytz.timezone) -- timezone of the returned index. If None, this is inferred from self.ids[0]
Returns: same length as self.ids, but strings are converted to datetime.datetime objects and localized to a timezone.
Return type: pandas.DatetimeIndex

get_datetime_map(return_timezone=None)[source]¶

Converts self.ids and result of to_datetime_index into dict that can be used to map ids to datetimes in contexts other than a single DataFrame index.

Parameters: return_timezone (None or pytz.timezone) -- timezone of the returned index. If None, this is inferred from self.ids[0]
Returns: {id: localized datetime}
Return type: dict

class dsgrid.dataformat.enumeration.EndUseEnumeration(name, ids, names)[source]¶

Bases: dsgrid.dataformat.enumeration.EndUseEnumerationBase

Provided for backward compatibility with dsgrid v0.1.0 datasets.

fuel(id)[source]¶

units(id)[source]¶

classmethod read_csv(filepath, name=None)[source]¶: Infer and read into the correct derived class.

class dsgrid.dataformat.enumeration.SingleFuelEndUseEnumeration(name, ids, names, fuel='Electricity', units='MWh')[source]¶

Bases: dsgrid.dataformat.enumeration.EndUseEnumerationBase

If the end-use enumeration only applies to a single fuel type, and all the data is in the same units, just give the fuel and units.

fuel(id)[source]¶

units(id)[source]¶

create_subset_enum(ids)[source]¶

Returns a new enumeration that is a subset of this one, based on keeping the items in ids.

Parameters: ids (list) -- subset of self.ids that should be kept in the new enumeration
Returns
Return type: self.__class__

persist(h5group)[source]¶

classmethod read_csv(filepath, name=None, fuel='Electricity', units='MWh')[source]¶: Infer and read into the correct derived class.

to_csv(filedir=None, filepath=None, overwrite=False)[source]¶

class dsgrid.dataformat.enumeration.FuelEnumeration(name, ids, names, units)[source]¶

Bases: dsgrid.dataformat.enumeration.Enumeration

dimension = 'fuel'¶

enum_dtype = dtype([('id', 'S64'), ('name', 'S128'), ('units', 'S64')])¶

checkvalues()[source]¶

get_units(id)[source]¶

create_subset_enum(ids)[source]¶

Returns a new enumeration that is a subset of this one, based on keeping the items in ids.

Parameters: ids (list) -- subset of self.ids that should be kept in the new enumeration
Returns
Return type: self.__class__

persist(h5group)[source]¶

classmethod load(h5group)[source]¶

classmethod read_csv(filepath, name=None)[source]¶

to_csv(filedir=None, filepath=None, overwrite=False)[source]¶

class dsgrid.dataformat.enumeration.MultiFuelEndUseEnumeration(name, ids, names, fuel_enum, fuel_ids)[source]¶

Bases: dsgrid.dataformat.enumeration.EndUseEnumerationBase

enum_dtype = dtype([('id', 'S64'), ('name', 'S128'), ('fuel_id', 'S64')])¶

checkvalues()[source]¶

property ids¶

property names¶

fuel(id)[source]¶

units(id)[source]¶

create_subset_enum(ids)[source]¶

Returns a new enumeration that is a subset of this one, based on keeping the items in ids.

Parameters: ids (list of 2-tuples) -- subset of self.ids that should be kept in the new enumeration
Returns
Return type: MultiFuelEndUseEnumeration

persist(h5group)[source]¶

classmethod load(h5group)[source]¶

classmethod read_csv(filepath, name=None, fuel_enum=None)[source]¶

id, name, fuel_id + pass in file_enum

or

id, name, fuel_id, fuel_name, units

or

id, name, fuel_id, units (and fuel_name will be guessed from fuel_id)

to_csv(filedir=None, filepath=None, overwrite=False)[source]¶

dsgrid.dataformat.sectordataset module¶

class dsgrid.dataformat.sectordataset.Datamap(value)[source]¶

Bases: object

Map between Datafile-level enumeration (enum) and Sectordataset-level sub-enumeration (enum_ids). Sub-enumeration may also have non-unity scaling factors. Per Sectordataset, these Datamaps link each enumeration value with:

a particular index in the dataset along the Enumeration's dimension; and

a scaling parameter to apply to the associated underlying data.

Multiple enumeration values can refer to the same index in the dataset's enumeration dimension, with the option to apply different scaling factors. The index is represented as a 32-bit unsigned integer, which limits dataset size to 2^32 - 2 in each dimension, with NULL_IDX (2^32 - 1) serving as the sentinel value assigned to enumeration values not described in the dataset (looking up data associated with such an enumeration value will simply return zeros)

value¶

datamap vector of length len(enum.ids) with 'idx' and 'scale' dimensions. For the example of (j, scale) = self.value[i],

i = position of enum_id in enum.ids (datafile-level enum)

j = position of enum_id in enum_ids (sectordataset-level sub-enum)

scale = scaling factor to apply to this enumeration element

We also have j = self.value[i]['idx'], scale = self.value[i]['scale'].

Type: numpy.ndarray

classmethod create(enum, enum_ids, enum_scales=None)[source]¶

Parameters

enum (dsgrid.enumeration.Enumeration) --
enum_ids (list) -- List of items in enum.ids
enum_scales (None or list) -- if list, is list of floats the same length as enum_ids

Returns

Return type

Datamap

classmethod load(dataset)[source]¶

Parameters: dataset (h5py.Dataset) -- a Datamap serialized to h5py
Returns
Return type: Datamap

update(dataset)[source]¶

Updates dataset with this Datamap's value. Overwrites current dataset[:,'idx'] and dataset[:,'scale'].

Parameters: dataset (h5py.Dataset) -- a Datamap serialized to h5py

property num_entries¶: The number of non-null idx values in this Datamap's value. Corresponds, e.g. to the number of number of distinct (up to a scaling factor) entries along a dimension.

get_subenum(enum)[source]¶

Parameters: enum (dsgrid.dataformat.enumeration.Enumeration) -- Datafile-level enumeration
Returns: ids in enum for which there is data in this SectorDataset. The ids are returned in the order imposed by the SectorData In the correct order for this Sectordataset
Return type: list of enum.ids

is_empty(enum_id, enum)[source]¶

Parameters

enum_id (str) -- element of enum.ids
enum (dsgrid.dataformat.enumeration.Enumeration) -- Datafile-level enumeration

Returns

True if the dataset has no data for enum_id, False otherwise

Return type

bool

get_map(enum)[source]¶

Get the data in this map in ordered, hashed form, with the dataset-level idx as the key.

Parameters: enum (dsgrid.dataformat.enumeration.Enumeration) -- Datafile-level enumeration
Returns: idx: (list of enum.ids, scales)
Return type: OrderedDict

ids(idx, enum)[source]¶

Returns the enum.ids for that are mapped to the dataset idx

Parameters

idx (int) -- dataset-level sub-enumeration index
enum (dsgrid.dataformat.Enumeration) -- datafile-level enumeration the sub-enum is based on

Returns

in particular, the ones that are mapped to idx, in the order specified by enum.ids

Return type

list of enum.ids

scales(idx)[source]¶

Returns the scaling factors that correspond to the dataset idx

Parameters: idx (int) -- dataset-level sub-enumeration index
Returns: one for each of the .ids(idx,enum), and in the same order
Return type: list of float

append_element(new_elem_idx, enum_ids, enum, scalings=[])[source]¶

Appends a new non-null element for this Datamap that defines the index (new_elem_idx) for data that corresponds to enum_ids in enum.

Parameters

new_elem_idx (int) -- index value for the new element
enum_ids (list) -- list of distinct elements in enum.ids
enum (dsgrid.dataformat.enumeration.Enumeration) -- should be same Enumeration originally used to .create this Datamap
scalinges (list of numeric) -- if empty, will be defaulted to 1.0. otherwise should be the same length as enum_ids

dsgrid.dataformat.sectordataset.append_element_to_dataset_dimension(dataset, new_elem_idx, enum_ids, enum, scalings=[])[source]¶

Helper method to do all the work of adding a new element to a SectorDataset dimenstion.

Parameters

dataset (h5py.Dataset) -- a Datamap serialized to h5py
new_elem_idx (int) -- index value for the new element
enum_ids (list) -- list of distinct elements in enum.ids
enum (dsgrid.dataformat.enumeration.Enumeration) -- should be same Enumeration originally used to .create this Datamap
scalinges (list of numeric) -- if empty, will be defaulted to 1.0. otherwise should be the same length as enum_ids

class dsgrid.dataformat.sectordataset.SectorDataset(datafile, sector_id, enduses, times)[source]¶

Bases: object

Creates a SectorDataset object. Note that this does not read from or write to datafile in any way, and should generally not be called directly. Instead, use the SectorDataset.load or SectorDataset.new class methods.

classmethod new(datafile, sector_id, enduses=None, times=None)[source]¶

classmethod load(datafile, f, sector_id)[source]¶

classmethod loadall(datafile, f, _upgrade_class=None)[source]¶

add_data_batch(dataframes, geo_ids, scalings=None, full_validation=True)[source]¶

Add a batch of new data to this SectorDataset. Uses the basic add_data functionality, but handles the h5 file so as to write data to memory first and only write to disk upon closing.

Parameters

dataframes (iterable) -- One dataframe per call to add_data
geo_ids (list) -- List of geo_ids arguments to pass to add_data
scalings (None or list of lists) -- If None, [] will be passed in to each call of add_data. Otherwise, this must be a list of the scalings arguments to pass, which must be lists of the same size as the geo_ids argument for that call.
full_validation (bool) -- If true, checks that all enumeration ids (time, enduse, and geography) are valid. If false, does this, but only for the first item.

add_data(dataframe, geo_ids, scalings=[], full_validation=True, _batch_file_object=None)[source]¶

Add new data to this SectorDataset, as part of the self.datafile HDF5.

Parameters

dataframe (pandas.DataFrame) -- Data to add, indexed by times, and with columns equal to enduses.
geo_ids (id or list of ids) -- Ids map to datafile.geo_enum
scalings (list of float) -- If non-empty, must be same length as geo_ids and represents the scaling factors for the geo_ids in order. Otherwise, a uniform value of 1.0 is assumed for all geo_ids.
full_validation (bool) -- If true, checks that all enumeration ids (time, enduse, and geography) are valid.

has_data(geo_id)[source]¶

get_datamap(dim_key)[source]¶

get_data(dataset_geo_index)[source]¶

Get data in this file's native format.

Parameters

dataset_geo_index (int) -- Index into the geography dimension of this dataset. Is an integer in the range [0,self.n_geos) that corresponds to the values in this dataset's geographies[:,'idx'] that are not equal to NULL_IDX

Returns

pandas.DataFrame -- data indexed by time and differentiated by enduse (as columns)
list of .datafile.geo_enum.ids -- geographic enum values this data applies to
list of float -- one scaling factor for each geographic enum value

copy_data(other_sectordataset, full_validation=True)[source]¶

Copy data from this SectorDataset into other_sectordataset.

Parameters

other_sectordataset (SectorDataset) -- target for this SectorDataset's data to be copied into
full_validation (bool) -- flag for SectorDataset.add_data

map_dimension(new_datafile, mapping)[source]¶

scale_data(new_datafile, factor=0.001)[source]¶

Scale all the data in self by factor, creating a new HDF5 file and corresponding Datafile.

Parameters

filepath (str) -- Location for the new HDF5 file to be created
factor (float) -- Factor by which all the data in the file is to be multiplied. The default value of 0.001 corresponds to converting the bottom-up data from kWh to MWh.

dsgrid.dataformat.upgrade module¶

class dsgrid.dataformat.upgrade.UpgradeDatafile[source]¶

Bases: object

from_version = None¶

to_version = None¶

classmethod upgrade(datafile, f)[source]¶

classmethod load_datafile(filepath)[source]¶

Load enough to return a Datafile object. Object should not be expected to be fully functional.

Parameters: filepath (str) -- path to Datafile
Returns: (partially) loaded Datafile in old format
Return type: dsgrid.dataformat.datafile.Datafile

classmethod load_sectordataset(datafile, f, sector_id)[source]¶: Load enough to return a SectorDataset object. Object should not be expected to be fully functional.

class dsgrid.dataformat.upgrade.DSG_0_1_0[source]¶

Bases: dsgrid.dataformat.upgrade.UpgradeDatafile

from_version = '0.1.0'¶

to_version = '0.2.0'¶

ZERO_IDX = 65535¶

classmethod load_sectordataset(datafile, f, sector_id)[source]¶: Load enough to return a SectorDataset object. Object should not be expected to be fully functional.

dsgrid.dataformat.upgrade.make_fuel_and_units_explicit(datafile, filepath, fuel='Electricity', units='MWh')[source]¶

Module contents¶

dsgrid.dataformat.get_str(a_str_or_bytes)[source]¶