dsgrid.dataformat package¶
Submodules¶
dsgrid.dataformat.datafile module¶
- class dsgrid.dataformat.datafile.Datafile(filepath, sector_enum, geography_enum, enduse_enum, time_enum, loading=False, version='0.4.0')[source]¶
Bases:
collections.abc.MappingCreate a new Datafile object. Use Datafile.load to open existing files.
- Parameters
filepath (str) -- file to create. typically has a .dsg file extension.
sector_enum (dsgrid.dataformat.enumeration.SectorEnumeration) -- enumeration of sectors to be stored in this Datafile
geography_enum (dsgrid.dataformat.enumeration.GeographyEnumeration) -- enumeration of geographies for this Datafile. typically these are geographical units at the same level of resolution. the Datafile does not have to specify values for every geography.
enduse_enum (dsgrid.dataformat.enumeration.EndUseEnumerationBase) -- enumeration of end-uses. there are mutiple EndUseEnumerationBase class types. typically one would use SingeFuelEndUseEnumeration or MultiFuelEndUseEnumeration.
time_enum (dsgrid.dataformat.enumeration.TimeEnumeration) -- enumeration specifying the time resolution of this Datafile
loading (bool) -- NOT FOR GENERAL USE -- Use Datafile.load to open existing files.
version (str) -- NOT FOR GENERAL USE -- New file are marked with the current VERSION. The load and update methods are used to manage version indicators for backward compatibility.
- upgrade(OLD_VERSIONS, overwrite=False, new_filepath=None)[source]¶
Upgrade this Datafile to the latest version. This method should not usually be called directly; it is called by Datafile.load if upgrade is True.
- Parameters
OLD_VERSIONS (OrderedDict of {version string : dsgrid.dataformat.upgrade.UpgradeDataFile}) -- import from dsgrid.dataformat.upgrade
overwrite (bool) -- if True, the upgraded Datafile overwrites the original
new_filepath (str) -- if not overwrite and new_filepath is not None, the new file is saved to new_filepath. If not overwrite and filepath is None then the upgraded file is saved to the same directory as the new file, with the new version number appended to the filename, and the extension '.dsg'
- add_sector(sector_id, enduses=None, times=None)[source]¶
Adds a SectorDataset to this file and returns it.
- scale_data(filepath, factor=0.001)[source]¶
Scale all the data in self by factor, creating a new HDF5 file and corresponding Datafile.
- Parameters
str (filepath |) -- Location for the new HDF5 file to be created
float (factor |) -- Factor by which all the data in the file is to be multiplied. The default value of 0.001 corresponds to converting the bottom-up data from kWh to MWh
dsgrid.dataformat.datatable module¶
dsgrid.dataformat.dimmap module¶
- class dsgrid.dataformat.dimmap.FullAggregationMap(from_enum, to_enum, exclude_list=[])[source]¶
Bases:
dsgrid.dataformat.dimmap.DimensionMap- Parameters
from_enum (dsgrid.dataformat.enumeration.Enumeration) --
to_enum (dsgrid.dataformat.enumeration.Enumeration) -- Class must correspond to the same dimension as from_enum, and the enumeration must have exactly one element
exclude_list (list of from_enum.ids) -- from_enum values that should be dropped from the aggregation
- class dsgrid.dataformat.dimmap.FilterToSubsetMap(from_enum, to_enum)[source]¶
Bases:
dsgrid.dataformat.dimmap.DimensionMap- Parameters
to_enum (-) --
- class dsgrid.dataformat.dimmap.ExplicitDisaggregation(from_enum, to_enum, dictmap, scaling_datafile=None)[source]¶
Bases:
dsgrid.dataformat.dimmap.ExplicitMapIf no scaling_datafile, scaling factors are assumed to be 1.0.
- property default_scaling¶
- property scaling_datatable¶
- class dsgrid.dataformat.dimmap.UnitConversionMap(from_enum, from_units, to_units)[source]¶
Bases:
dsgrid.dataformat.dimmap.DimensionMapConvert from_units to to_units.
- Parameters
from_enum (EndUseEnumerationBase) --
from_units (list of str) -- List of units in from_enum that are to be converted
to_units (list of str) -- List of units to convert to. Same length list as from_units.
- CONVERSION_FACTORS = {('GWh', 'TWh'): 0.001, ('MWh', 'GWh'): 0.001, ('kWh', 'MWh'): 0.001}¶
dsgrid.dataformat.enumeration module¶
- class dsgrid.dataformat.enumeration.Enumeration(name, ids, names)[source]¶
Bases:
object- max_id_len = 64¶
- max_name_len = 128¶
- enum_dtype = dtype([('id', 'S64'), ('name', 'S128')])¶
- dimension = None¶
- class dsgrid.dataformat.enumeration.SectorEnumeration(name, ids, names)[source]¶
Bases:
dsgrid.dataformat.enumeration.Enumeration- dimension = 'sector'¶
- class dsgrid.dataformat.enumeration.GeographyEnumeration(name, ids, names)[source]¶
Bases:
dsgrid.dataformat.enumeration.Enumeration- dimension = 'geography'¶
- class dsgrid.dataformat.enumeration.EndUseEnumerationBase(name, ids, names)[source]¶
Bases:
dsgrid.dataformat.enumeration.Enumeration- dimension = 'enduse'¶
- class dsgrid.dataformat.enumeration.TimeEnumeration(name, ids, names)[source]¶
Bases:
dsgrid.dataformat.enumeration.Enumeration- dimension = 'time'¶
- class TIMESTAMP_POSITION(value)¶
Bases:
enum.EnumAn enumeration.
- period_beginning = 1¶
- period_midpoint = 2¶
- period_ending = 3¶
- TIMEZONE_DISPLAY_NAMES = {'Etc/GMT+5': 'EST', 'Etc/GMT+6': 'CST', 'Etc/GMT+7': 'MST', 'Etc/GMT+8': 'PST'}¶
- TIMEZONE_LOOKUP = {'CST': 'Etc/GMT+6', 'EST': 'Etc/GMT+5', 'MST': 'Etc/GMT+7', 'PST': 'Etc/GMT+8'}¶
- classmethod create(enum_name, start, duration, resolution, extent_timezone=<UTC>, store_timezone=None, timestamp_position=TIMESTAMP_POSITION.period_ending)[source]¶
Create a new time enumeration based on the specified temporal extents, resolution, and timezone.
- Parameters
enum_name (str) -- name for this enumeration, ideally descriptive of the parameters used for creation
start (datetime.datetime) -- beginning of the time period to be represented by the timestamps
duration (datetime.timedelta) -- total length of time to be covered
resolution (datetime.timedelta) -- timestep for the enumeration
extent_timezone (pytz.timezone) -- timezone that should be used to interpret the extent parameters
store_timezone (None or pytz.timezone) -- timezone to write the ids and names in. If None, extent_timezone is used.
timestamp_position (TimeEnumeration.TIMESTAMP_POSITION or convertable str) -- whether timestamps are placed at the beginning, ending, or midpoint of the time period being described
- Returns
- Return type
- property store_timezone¶
Examines the first id to determine what timezone this TimeEnumeration is stored in. Assumes the usage of datetime, pytz, and the "standard" timezones, e.g.,
pytz.timezone('Etc/GMT+5') = EST
pytz.timezone('Etc/GMT+6') = CST
pytz.timezone('Etc/GMT+7') = MST
pytz.timezone('Etc/GMT+8') = PST
- property store_timezone_display_name¶
Interprets self.ids[0] to report what timezone this enumeration is stored in. Converts from pytz strings to what we typically use, namely EST, CST, MST, or PST.
- Returns
timezone this TimeEnumeration is stored in, per self.store_timezone and self.TIMEZONE_DISPLAY_NAMES
- Return type
str
- property resolution¶
The resolution of this TimeEnumeration.
- Returns
Returns a single value if the intervals are all of the same length. Returns a vector of values if they are different.
- Return type
dt.timedelta or array of dt.timedelta
- get_extents(report_timezone=None, timestamp_position=TIMESTAMP_POSITION.period_ending)[source]¶
Returns the inclusive temporal extents represented in this TimeEnumeration. That interpretation requires knowledge of the timestamp_postion--beginning, end, or midpoint of the period being described.
- Parameters
report_timezone (pytz.timezone) -- Timezone in which to report out the result
- Returns
Tuple of start and end times, inclusive of all time represented based on the timestamp position, and in report_timezone.
- Return type
(datetime.datetime,datetime.datetime)
- to_datetime_index(return_timezone=None)[source]¶
Return a Pandas DatetimeIndex corresponding to this TimeEnumeration. By default, localizes the timestamps to the timezone inferred based on the text of the first enumeration id. If return_timezpone is None, this is what is returned. If return_timezone is not None, the index is converted to that timezone before being returned.
- Parameters
return_timezone (None or pytz.timezone) -- timezone of the returned index. If None, this is inferred from self.ids[0]
- Returns
same length as self.ids, but strings are converted to datetime.datetime objects and localized to a timezone.
- Return type
pandas.DatetimeIndex
- get_datetime_map(return_timezone=None)[source]¶
Converts self.ids and result of to_datetime_index into dict that can be used to map ids to datetimes in contexts other than a single DataFrame index.
- Parameters
return_timezone (None or pytz.timezone) -- timezone of the returned index. If None, this is inferred from self.ids[0]
- Returns
{id: localized datetime}
- Return type
dict
- class dsgrid.dataformat.enumeration.EndUseEnumeration(name, ids, names)[source]¶
Bases:
dsgrid.dataformat.enumeration.EndUseEnumerationBaseProvided for backward compatibility with dsgrid v0.1.0 datasets.
- class dsgrid.dataformat.enumeration.SingleFuelEndUseEnumeration(name, ids, names, fuel='Electricity', units='MWh')[source]¶
Bases:
dsgrid.dataformat.enumeration.EndUseEnumerationBaseIf the end-use enumeration only applies to a single fuel type, and all the data is in the same units, just give the fuel and units.
- create_subset_enum(ids)[source]¶
Returns a new enumeration that is a subset of this one, based on keeping the items in ids.
- Parameters
ids (list) -- subset of self.ids that should be kept in the new enumeration
- Returns
- Return type
self.__class__
- class dsgrid.dataformat.enumeration.FuelEnumeration(name, ids, names, units)[source]¶
Bases:
dsgrid.dataformat.enumeration.Enumeration- dimension = 'fuel'¶
- enum_dtype = dtype([('id', 'S64'), ('name', 'S128'), ('units', 'S64')])¶
- class dsgrid.dataformat.enumeration.MultiFuelEndUseEnumeration(name, ids, names, fuel_enum, fuel_ids)[source]¶
Bases:
dsgrid.dataformat.enumeration.EndUseEnumerationBase- enum_dtype = dtype([('id', 'S64'), ('name', 'S128'), ('fuel_id', 'S64')])¶
- property ids¶
- property names¶
- create_subset_enum(ids)[source]¶
Returns a new enumeration that is a subset of this one, based on keeping the items in ids.
- Parameters
ids (list of 2-tuples) -- subset of self.ids that should be kept in the new enumeration
- Returns
- Return type
dsgrid.dataformat.sectordataset module¶
- class dsgrid.dataformat.sectordataset.Datamap(value)[source]¶
Bases:
objectMap between Datafile-level enumeration (enum) and Sectordataset-level sub-enumeration (enum_ids). Sub-enumeration may also have non-unity scaling factors. Per Sectordataset, these Datamaps link each enumeration value with:
a particular index in the dataset along the Enumeration's dimension; and
a scaling parameter to apply to the associated underlying data.
Multiple enumeration values can refer to the same index in the dataset's enumeration dimension, with the option to apply different scaling factors. The index is represented as a 32-bit unsigned integer, which limits dataset size to 2^32 - 2 in each dimension, with NULL_IDX (2^32 - 1) serving as the sentinel value assigned to enumeration values not described in the dataset (looking up data associated with such an enumeration value will simply return zeros)
- value¶
datamap vector of length len(enum.ids) with 'idx' and 'scale' dimensions. For the example of (j, scale) = self.value[i],
i = position of enum_id in enum.ids (datafile-level enum)
j = position of enum_id in enum_ids (sectordataset-level sub-enum)
scale = scaling factor to apply to this enumeration element
We also have j = self.value[i]['idx'], scale = self.value[i]['scale'].
- Type
numpy.ndarray
- classmethod create(enum, enum_ids, enum_scales=None)[source]¶
- Parameters
enum (dsgrid.enumeration.Enumeration) --
enum_ids (list) -- List of items in enum.ids
enum_scales (None or list) -- if list, is list of floats the same length as enum_ids
- Returns
- Return type
- classmethod load(dataset)[source]¶
- Parameters
dataset (h5py.Dataset) -- a Datamap serialized to h5py
- Returns
- Return type
- update(dataset)[source]¶
Updates dataset with this Datamap's value. Overwrites current dataset[:,'idx'] and dataset[:,'scale'].
- Parameters
dataset (h5py.Dataset) -- a Datamap serialized to h5py
- property num_entries¶
The number of non-null idx values in this Datamap's value. Corresponds, e.g. to the number of number of distinct (up to a scaling factor) entries along a dimension.
- get_subenum(enum)[source]¶
- Parameters
enum (dsgrid.dataformat.enumeration.Enumeration) -- Datafile-level enumeration
- Returns
ids in enum for which there is data in this SectorDataset. The ids are returned in the order imposed by the SectorData In the correct order for this Sectordataset
- Return type
list of enum.ids
- is_empty(enum_id, enum)[source]¶
- Parameters
enum_id (str) -- element of enum.ids
enum (dsgrid.dataformat.enumeration.Enumeration) -- Datafile-level enumeration
- Returns
True if the dataset has no data for enum_id, False otherwise
- Return type
bool
- get_map(enum)[source]¶
Get the data in this map in ordered, hashed form, with the dataset-level idx as the key.
- Parameters
enum (dsgrid.dataformat.enumeration.Enumeration) -- Datafile-level enumeration
- Returns
idx: (list of enum.ids, scales)
- Return type
OrderedDict
- ids(idx, enum)[source]¶
Returns the enum.ids for that are mapped to the dataset idx
- Parameters
idx (int) -- dataset-level sub-enumeration index
enum (dsgrid.dataformat.Enumeration) -- datafile-level enumeration the sub-enum is based on
- Returns
in particular, the ones that are mapped to idx, in the order specified by enum.ids
- Return type
list of enum.ids
- scales(idx)[source]¶
Returns the scaling factors that correspond to the dataset idx
- Parameters
idx (int) -- dataset-level sub-enumeration index
- Returns
one for each of the .ids(idx,enum), and in the same order
- Return type
list of float
- append_element(new_elem_idx, enum_ids, enum, scalings=[])[source]¶
Appends a new non-null element for this Datamap that defines the index (new_elem_idx) for data that corresponds to enum_ids in enum.
- Parameters
new_elem_idx (int) -- index value for the new element
enum_ids (list) -- list of distinct elements in enum.ids
enum (dsgrid.dataformat.enumeration.Enumeration) -- should be same Enumeration originally used to .create this Datamap
scalinges (list of numeric) -- if empty, will be defaulted to 1.0. otherwise should be the same length as enum_ids
- dsgrid.dataformat.sectordataset.append_element_to_dataset_dimension(dataset, new_elem_idx, enum_ids, enum, scalings=[])[source]¶
Helper method to do all the work of adding a new element to a SectorDataset dimenstion.
- Parameters
dataset (h5py.Dataset) -- a Datamap serialized to h5py
new_elem_idx (int) -- index value for the new element
enum_ids (list) -- list of distinct elements in enum.ids
enum (dsgrid.dataformat.enumeration.Enumeration) -- should be same Enumeration originally used to .create this Datamap
scalinges (list of numeric) -- if empty, will be defaulted to 1.0. otherwise should be the same length as enum_ids
- class dsgrid.dataformat.sectordataset.SectorDataset(datafile, sector_id, enduses, times)[source]¶
Bases:
objectCreates a SectorDataset object. Note that this does not read from or write to datafile in any way, and should generally not be called directly. Instead, use the SectorDataset.load or SectorDataset.new class methods.
- add_data_batch(dataframes, geo_ids, scalings=None, full_validation=True)[source]¶
Add a batch of new data to this SectorDataset. Uses the basic add_data functionality, but handles the h5 file so as to write data to memory first and only write to disk upon closing.
- Parameters
dataframes (iterable) -- One dataframe per call to add_data
geo_ids (list) -- List of geo_ids arguments to pass to add_data
scalings (None or list of lists) -- If None, [] will be passed in to each call of add_data. Otherwise, this must be a list of the scalings arguments to pass, which must be lists of the same size as the geo_ids argument for that call.
full_validation (bool) -- If true, checks that all enumeration ids (time, enduse, and geography) are valid. If false, does this, but only for the first item.
- add_data(dataframe, geo_ids, scalings=[], full_validation=True, _batch_file_object=None)[source]¶
Add new data to this SectorDataset, as part of the self.datafile HDF5.
- Parameters
dataframe (pandas.DataFrame) -- Data to add, indexed by times, and with columns equal to enduses.
geo_ids (id or list of ids) -- Ids map to datafile.geo_enum
scalings (list of float) -- If non-empty, must be same length as geo_ids and represents the scaling factors for the geo_ids in order. Otherwise, a uniform value of 1.0 is assumed for all geo_ids.
full_validation (bool) -- If true, checks that all enumeration ids (time, enduse, and geography) are valid.
- get_data(dataset_geo_index)[source]¶
Get data in this file's native format.
- Parameters
dataset_geo_index (int) -- Index into the geography dimension of this dataset. Is an integer in the range [0,self.n_geos) that corresponds to the values in this dataset's geographies[:,'idx'] that are not equal to NULL_IDX
- Returns
pandas.DataFrame -- data indexed by time and differentiated by enduse (as columns)
list of .datafile.geo_enum.ids -- geographic enum values this data applies to
list of float -- one scaling factor for each geographic enum value
- copy_data(other_sectordataset, full_validation=True)[source]¶
Copy data from this SectorDataset into other_sectordataset.
- Parameters
other_sectordataset (SectorDataset) -- target for this SectorDataset's data to be copied into
full_validation (bool) -- flag for SectorDataset.add_data
- scale_data(new_datafile, factor=0.001)[source]¶
Scale all the data in self by factor, creating a new HDF5 file and corresponding Datafile.
- Parameters
filepath (str) -- Location for the new HDF5 file to be created
factor (float) -- Factor by which all the data in the file is to be multiplied. The default value of 0.001 corresponds to converting the bottom-up data from kWh to MWh.
dsgrid.dataformat.upgrade module¶
- class dsgrid.dataformat.upgrade.UpgradeDatafile[source]¶
Bases:
object- from_version = None¶
- to_version = None¶