Dataset Mapping Plan¶

pydantic model dsgrid.query.dataset_mapping_plan.DatasetMappingPlan[source]¶

Defines how to map a dataset to a list of dimensions.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

$digraph "Entity Relationship Diagram created by erdantic" { graph [fontcolor=gray66, fontname="Times New Roman,Times,Liberation Serif,serif", fontsize=9, nodesep=0.5, rankdir=LR, ranksep=1.5 ]; node [fontname="Times New Roman,Times,Liberation Serif,serif", fontsize=14, label="\N", shape=plain ]; edge [dir=both]; "dsgrid.config.dimension_mapping_base.DimensionMappingReferenceModel" [label=<<table border="0" cellborder="1" cellspacing="0"><tr><td port="_root" colspan="2"><b>DimensionMappingReferenceModel</b></td></tr><tr><td>from_dimension_type</td><td port="from_dimension_type">DimensionType</td></tr><tr><td>to_dimension_type</td><td port="to_dimension_type">DimensionType</td></tr><tr><td>mapping_id</td><td port="mapping_id">str</td></tr><tr><td>version</td><td port="version">str</td></tr><tr><td>required_for_validation</td><td port="required_for_validation">bool</td></tr></table>>, tooltip="dsgrid.config.dimension_mapping_base.DimensionMappingReferenceModel

Reference to a dimension mapping stored in the registry.&#\ xA;
The DimensionMappingReferenceModel is utilized by the project configuration (project.json5) as well as by the
dimension \ mapping reference configuration (dimension_mapping_references.json5) that may be required when submitting a dataset to a project.&#\ xA;"]; "dsgrid.query.dataset_mapping_plan.DatasetMappingPlan" [label=<<table border="0" cellborder="1" cellspacing="0"><tr><td port="_root" colspan="2"><b>DatasetMappingPlan</b></td></tr><tr><td>dataset_id</td><td port="dataset_id">str</td></tr><tr><td>mappings</td><td port="mappings">list[MapOperation]</td></tr><tr><td>apply_fraction_op</td><td port="apply_fraction_op">MapOperation</td></tr><tr><td>apply_scaling_factor_op</td><td port="apply_scaling_factor_op">MapOperation</td></tr><tr><td>convert_units_op</td><td port="convert_units_op">MapOperation</td></tr><tr><td>map_time_op</td><td port="map_time_op">MapOperation</td></tr><tr><td>keep_intermediate_files</td><td port="keep_intermediate_files">bool</td></tr></table>>, tooltip="dsgrid.query.dataset_mapping_plan.DatasetMappingPlan

Defines how to map a dataset to a list of dimensions.
"]; "dsgrid.query.dataset_mapping_plan.MapOperation" [label=<<table border="0" cellborder="1" cellspacing="0"><tr><td port="_root" colspan="2"><b>MapOperation</b></td></tr><tr><td>name</td><td port="name">str</td></tr><tr><td>handle_data_skew</td><td port="handle_data_skew">bool | None</td></tr><tr><td>persist</td><td port="persist">bool</td></tr><tr><td>mapping_reference</td><td port="mapping_reference">DimensionMappingReferenceModel | None</td></tr></table>>, tooltip="dsgrid.query.dataset_mapping_plan.MapOperation

Defines one mapping operation for a dataset.
"]; "dsgrid.query.dataset_mapping_plan.DatasetMappingPlan":apply_fraction_op:e -> "dsgrid.query.dataset_mapping_plan.MapOperation":_root:w [arrowhead=noneteetee, arrowtail=nonenone]; "dsgrid.query.dataset_mapping_plan.DatasetMappingPlan":apply_scaling_factor_op:e -> "dsgrid.query.dataset_mapping_plan.MapOperation":_root:w [arrowhead=noneteetee, arrowtail=nonenone]; "dsgrid.query.dataset_mapping_plan.DatasetMappingPlan":convert_units_op:e -> "dsgrid.query.dataset_mapping_plan.MapOperation":_root:w [arrowhead=noneteetee, arrowtail=nonenone]; "dsgrid.query.dataset_mapping_plan.DatasetMappingPlan":map_time_op:e -> "dsgrid.query.dataset_mapping_plan.MapOperation":_root:w [arrowhead=noneteetee, arrowtail=nonenone]; "dsgrid.query.dataset_mapping_plan.DatasetMappingPlan":mappings:e -> "dsgrid.query.dataset_mapping_plan.MapOperation":_root:w [arrowhead=crownone, arrowtail=nonenone]; "dsgrid.query.dataset_mapping_plan.MapOperation":mapping_reference:e -> "dsgrid.config.dimension_mapping_base.DimensionMappingReferenceModel":_root:w [arrowhead=noneteetee, arrowtail=nonenone]; }$

Fields:

apply_fraction_op (dsgrid.query.dataset_mapping_plan.MapOperation)
apply_scaling_factor_op (dsgrid.query.dataset_mapping_plan.MapOperation)
convert_units_op (dsgrid.query.dataset_mapping_plan.MapOperation)
dataset_id (str)
keep_intermediate_files (bool)
map_time_op (dsgrid.query.dataset_mapping_plan.MapOperation)
mappings (list[dsgrid.query.dataset_mapping_plan.MapOperation])

Validators:

check_names » all fields

field apply_fraction_op: MapOperation = MapOperation(name='apply_fraction_op', handle_data_skew=False, persist=False, mapping_reference=None)¶

Defines handling of the query that applies the from_fraction value after mapping all dimensions.

Validated by:

check_names

field apply_scaling_factor_op: MapOperation = MapOperation(name='apply_scaling_factor_op', handle_data_skew=False, persist=False, mapping_reference=None)¶

Defines handling of the query that applies the scaling factor, if one exists. This happens after apply_fraction_op.

Validated by:

check_names

field convert_units_op: MapOperation = MapOperation(name='convert_units_op', handle_data_skew=False, persist=False, mapping_reference=None)¶

Defines handling of the query that converts units. This happens after apply_fraction_op and before mapping time. It is strongly recommended to not persist this table because the code currently always persists before mapping time.

Validated by:

check_names

field dataset_id: str [Required]¶

ID of the dataset to be mapped.

Validated by:

check_names

field keep_intermediate_files: bool = False¶

If True, keep the intermediate tables created during the mapping process. This is useful for debugging and benchmarking, but will consume more disk space.

Validated by:

check_names

field map_time_op: MapOperation = MapOperation(name='map_time', handle_data_skew=False, persist=False, mapping_reference=None)¶

Defines handling of the query that maps the time dimension. This happens after convert_units_op. Unlike the other dimension mappings, this does not use the generic mapping code. It relies on specific handling in chronify by time type.

Validated by:

check_names

field mappings: list[MapOperation] = []¶

Defines how to map each dimension of the dataset.

Validated by:

check_names

validator check_names » all fields[source]¶

compute_hash() → str[source]¶: Compute a hash of the mapping plan.

list_mapping_operations() → list[MapOperation][source]¶: List all mapping operations in the plan, in order.

pydantic model dsgrid.query.dataset_mapping_plan.MapOperation[source]¶

Defines one mapping operation for a dataset.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

handle_data_skew (bool | None)
mapping_reference (dsgrid.config.dimension_mapping_base.DimensionMappingReferenceModel | None)
name (str)
persist (bool)

field handle_data_skew: bool | None = None¶: Use a salting technique to handle data skew in this mapping operation. Skew can happen when some partitions have significantly more data than others, resulting in unbalanced task execution times. If this value is None, dsgrid will make its own determination of whether this should be done based on the characteristics of the mapping operation. Setting it to True or False will override that behavior and inform dsgrid of what to do. This will automatically trigger a persist to the filesystem (implicitly setting persist to True).

field mapping_reference: DimensionMappingReferenceModel | None = None¶: Reference to the model used to map the dimension. Set at runtime by dsgrid.

field name: str [Required]¶: Identifier for the mapping operation. This must be a unique name.

field persist: bool = False¶: Persist the intermediate dataset to the filesystem after mapping this dimension. This can be useful to prevent the query from becoming too large. It can also be useful for benchmarking and debugging purposes.

pydantic model dsgrid.query.dataset_mapping_plan.MapOperationCheckpoint[source]¶

Defines a completed mapping operation that has been persisted to the filesystem.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

completed_operation_names (list[str])
dataset_id (str)
mapping_plan_hash (str)
persisted_table_filename (pathlib.Path)
timestamp (datetime.datetime)

field completed_operation_names: list[str] [Required]¶: Names of the completed mapping operations.

field dataset_id: str [Required]¶

field mapping_plan_hash: str [Required]¶: Hash of the mapping plan. This is used to ensure that the mapping plan hasn’t changed.

field persisted_table_filename: Path [Required]¶: Path to a persisted file.

field timestamp: datetime [Optional]¶: Timestamp of when the operation was completed.