How to Filter a Query¶
dsgrid offers several ways to filter the result of a query. It is important to understand some dsgrid behaviors to get an optimal result. Please refer to query concepts for details.
The examples below show how to define the filters in JSON5 or Python as well as the equivalent implementation if you were to filter the dataframe with Spark in Python (pyspark).
All examples except DimensionFilterBetweenColumnOperatorModel assume that the dataframe being filtered is the dimension record table. DimensionFilterBetweenColumnOperatorModel assumes that the table is the load data dataframe with time-series information.
Note
Whenever multiple filters are provided in an array, dsgrid performs an and across all filters.
Filter Types¶
1. Expression Filter¶
Filter the table where a dimension column matches an expression. This example filters the geography dimension by selecting only data where the county matches the ID 06037 (Los Angeles, CA). This is equivalent to column == "06037". You can use any SQL expression.
Note
All values for dimensions in the filters must be strings.
dimension_filters: [
{
dimension_type: "geography",
dimension_name: "county",
operator: "==",
value: "06037",
filter_type: "expression",
negate: false,
},
]
dimension_filters=[
DimensionFilterExpressionModel(
dimension_type=DimensionType.GEOGRAPHY,
dimension_name="county",
operator="==",
value="06037",
negate=False,
),
]
df.filter("geography == '06037'")
2. Raw Expression Filter¶
Similar to the first but use a raw expression. DimensionFilterExpressionModel creates a string by inserting the input parameters and adding required quotes. DimensionFilterExpressionRawModel uses your exact value. This allows you to make a more complex, custom expression.
dimension_filters: [
{
dimension_type: "geography",
dimension_name: "county",
value: "== '06037'",
filter_type: "expression_raw",
negate: false,
},
]
dimension_filters=[
DimensionFilterExpressionRawModel(
dimension_type=DimensionType.GEOGRAPHY,
dimension_name="county",
value="== '06037'",
negate=False,
),
]
df.filter("geography == '06037'")
3. Column Operator Filter¶
Filter a table where the specified column matches the specified value(s) according to the Spark SQL operator. This is useful for cases where you want to match partial strings or use a list of possible values.
Example: Filter by multiple values
dimension_filters: [
{
dimension_type: "geography",
dimension_name: "county",
column: "id",
operator: "isin",
value: ["06037", "06073"],
filter_type: "column_operator",
negate: false,
},
]
dimension_filters=[
DimensionFilterColumnOperatorModel(
dimension_type=DimensionType.GEOGRAPHY,
dimension_name="county",
column="id",
operator="isin",
value=["06037", "06073"],
negate=False,
),
]
df.filter(col("geography").isin(["06037", "06073"]))
Example: Pattern matching with LIKE
dimension_filters: [
{
dimension_type: "geography",
dimension_name: "county",
column: "name",
operator: "like",
value: "%County",
filter_type: "column_operator",
negate: false,
},
]
dimension_filters=[
DimensionFilterColumnOperatorModel(
dimension_type=DimensionType.GEOGRAPHY,
dimension_name="county",
column="name",
operator="like",
value="%County",
negate=False,
),
]
df.filter(col("name").like("%County"))
4. Supplemental Dimension Filter¶
Filter on supplemental dimension records. This example filters the metric supplemental dimension by fuel type.
dimension_filters: [
{
dimension_type: "metric",
dimension_name: "end_uses_by_fuel_type",
column: "fuel_id",
operator: "isin",
value: ["electricity", "natural_gas"],
filter_type: "supplemental_column_operator",
negate: false,
},
]
dimension_filters=[
SupplementalDimensionFilterColumnOperatorModel(
dimension_type=DimensionType.METRIC,
dimension_name="end_uses_by_fuel_type",
column="fuel_id",
operator="isin",
value=["electricity", "natural_gas"],
negate=False,
),
]
5. Time-Based Filter¶
Filter data between two time columns. This is useful for selecting data within specific time ranges.
dimension_filters: [
{
dimension_type: "time",
dimension_name: "time_est",
column1: "timestamp",
column2: "timestamp",
operator: "between",
value: ["2012-01-01 00:00:00", "2012-01-31 23:59:59"],
filter_type: "between_column_operator",
negate: false,
},
]
dimension_filters=[
DimensionFilterBetweenColumnOperatorModel(
dimension_type=DimensionType.TIME,
dimension_name="time_est",
column1="timestamp",
column2="timestamp",
operator="between",
value=["2012-01-01 00:00:00", "2012-01-31 23:59:59"],
negate=False,
),
]
df.filter(
(col("timestamp") >= "2012-01-01 00:00:00") &
(col("timestamp") <= "2012-01-31 23:59:59")
)
Common Operators¶
Operator |
Description |
Example Value |
|---|---|---|
|
Equals |
|
|
Not equals |
|
|
Greater than |
|
|
Greater than or equal |
|
|
Less than |
|
|
Less than or equal |
|
|
In list |
|
|
Pattern match |
|
|
Regex match |
|
|
Between two values |
|
Negating Filters¶
Set negate: true to invert any filter. For example, to exclude specific counties:
dimension_filters: [
{
dimension_type: "geography",
dimension_name: "county",
column: "id",
operator: "isin",
value: ["02013", "02016"], // Alaska counties
filter_type: "column_operator",
negate: true, // Exclude these counties
},
]
Combining Multiple Filters¶
When multiple filters are provided in an array, dsgrid applies them with an AND operation. To filter for California counties in years 2030-2050:
dimension_filters: [
{
dimension_type: "geography",
dimension_name: "county",
column: "id",
operator: "like",
value: "06%", // CA counties start with 06
filter_type: "column_operator",
negate: false,
},
{
dimension_type: "model_year",
dimension_name: "model_year",
column: "id",
operator: "between",
value: ["2030", "2050"],
filter_type: "column_operator",
negate: false,
},
]
Best Practices¶
Filter early: Apply filters at the dataset level when possible to reduce data processing
Use appropriate filter types: Choose the most specific filter type for your use case
Leverage supplemental dimensions: Use supplemental dimension filters for complex aggregations
Test incrementally: Start with simple filters and add complexity
Check dimension names: Ensure dimension names match those defined in the project
Next Steps¶
Learn about query concepts for understanding query processing
Follow the query project tutorial
Explore the CLI reference for command-line query options