How to Filter a Query¶
dsgrid offers several ways to filter the result of a query. It is important to understand some dsgrid behaviors to get an optimal result. Please refer to Project Queries for details.
The examples below show how to define the filters in JSON5
or Python
as well as the
equivalent implementation if you were to filter the dataframe with Spark
in Python
(pyspark
).
All examples except DimensionFilterBetweenColumnOperatorModel
assume that the dataframe being
filtered is the dimension record table. DimensionFilterBetweenColumnOperatorModel
assumes that
the table is the load data dataframe with time-series information.
Note
Whenever multiple filters are provided in an array dsgrid performs an and
across all
filters.
Filter the table where a dimension column matches an expression. This example filters the geography dimension by selecting only data where the county matches the ID
06037
(Los Angeles, CA). This is equivalent tocolumn == "06037"
. You can use any SQL expression. Refer to the Spark documentation for more information.
Note
All values for dimensions in the filters must be strings.
dimension_filters: [
{
dimension_type: "geography",
dimension_query_name: "county",
operator: "==",
value: "06037",
filter_type: "expression",
negate: false,
},
]
dimension_filters=[
DimensionFilterExpressionModel(
dimension_type=DimensionType.GEOGRAPHY",
dimension_query_name="county",
operator="==",
value="06037",
negate=False,
),
]
df.filter("geography == '06037'")
Similar to the first but use a raw expression.
DimensionFilterExpressionModel
creates a string by inserting the input parameters and adding required quotes.DimensionFilterExpressionRawModel
uses your exact value. This allows you to make a more complex, custom expression. The following example reaches the same result as above. You can use any expression supported by pyspark.sql.DataFrame.filter.
dimension_filters: [
{
dimension_type: "geography",
dimension_query_name: "county",
value: "== '06037'",
filter_type: "expression_raw",
negate: false,
},
]
dimension_filters=[
DimensionFilterExpressionRawModel(
dimension_type=DimensionType.GEOGRAPHY",
dimension_query_name="county",
value="== '06037'",
negate=False,
),
]
df.filter("geography == '06037'")
Filter a table where the specified column matches the specified value(s) according to the Spark SQL operator. This is useful for cases where you want to match partial strings or use a list of possible values.
dimension_filters: [
{
dimension_type: "model_year",
dimension_query_name: "model_year",
column: "id",
operator: "isin",
value: [
"2030",
"2040",
"2050",
],
filter_type: "column_operator"
negate: false,
},
{
dimension_type: "sector",
dimension_query_name: "sector",
column: "id",
operator: "startswith",
value: "com",
filter_type: "column_operator"
negate: false,
},
],
dimension_filters=[
DimensionFilterColumnOperatorModel(
dimension_type=DimensionType.MODEL_YEAR,
dimension_query_name="model_year",
column=id,
operator="isin",
value=[
"2030",
"2040",
"2050",
],
negate=False,
),
DimensionFilterColumnOperatorModel(
dimension_type="sector",
dimension_query_name="sector",
column="id",
operator="startswith",
value="com",
filter_type="column_operator"
negate=False,
),
]
df.filter(df["model_year"].isin(["2030", "2040", "2050"])) \
.filter(df["sector"].startswith("com"))
Filter a table with values from a subset dimension. This example filters the table to include only electricity end uses.
dimension_filters: [
{
dimension_type: "metric",
dimension_query_names: ["electricity_end_uses"],
filter_type: "subset"
},
],
dimension_filters=[
SubsetDimensionFilterModel(
dimension_type=DimensionType.METRIC,
dimension_query_names=["electricity_end_uses"],
),
]
# Note that these are example dimension record IDs for demonstration purposes.
df.filter(df["end_use"].isin(["electricity_cooling", "electricity_heating"]))
Filter a table with records from a supplemental dimension. This example filters the table to include only counties in Colorado or New Mexico. Note that it does not change the dimensionality of the data or perform aggregations. It only tells dsgrid to filter out counties that don’t have a mapping in the supplemental dimension records.
dimension_filters: [
{
dimension_type: "geography",
dimension_query_name: "state",
column: "id",
operator: "isin",
value: ["CO", "NM"],
filter_type: "supplemental_column_operator"
},
],
dimension_filters=[
SupplementalDimensionFilterColumnOperatorModel(
dimension_type=DimensionType.GEOGRAPHY,
dimension_query_name="state",
column="id",
operator="isin",
value=["CO", "NM"],
),
]
df.filter(df["id"].isin(["CO", "NM"]))
Filter a table with times between two timestamps (inclusive on both sides).
dimension_filters: [
{
dimension_type: "time",
dimension_query_name: "time_est",
column: "time_est",
lower_bound: "2012-07-01 00:00:00",
upper_bound: "2012-08-01 00:00:00",
filter_type: "between_column_operator"
negate: false,
},
],
dimension_filters=[
DimensionFilterBetweenColumnOperatorModel(
dimension_type=DimensionType.TIME,
dimension_query_name="time_est",
column="time_est",
lower_bound="2012-07-01 00:00:00",
upper_bound="2012-08-01 00:00:00",
negate=False,
),
]
df.filter(df["timestamp"].between("2012-07-01 00:00:00", "2012-08-01 00:00:00"))