.. _filter-a-query: ********************* How to Filter a Query ********************* dsgrid offers several ways to filter the result of a query. It is important to understand some dsgrid behaviors to get an optimal result. Please refer to :ref:`project-queries-explanation` for details. The examples below show how to define the filters in ``JSON5`` or ``Python`` as well as the equivalent implementation if you were to filter the dataframe with ``Spark`` in ``Python`` (``pyspark``). All examples except ``DimensionFilterBetweenColumnOperatorModel`` assume that the dataframe being filtered is the dimension record table. ``DimensionFilterBetweenColumnOperatorModel`` assumes that the table is the load data dataframe with time-series information. .. note:: Whenever multiple filters are provided in an array dsgrid performs an ``and`` across all filters. 1. Filter the table where a dimension column matches an expression. This example filters the geography dimension by selecting only data where the county matches the ID ``06037`` (Los Angeles, CA). This is equivalent to ``column == "06037"``. You can use any SQL expression. Refer to the `Spark documentation `_ for more information. .. note:: All values for dimensions in the filters must be strings. .. tabs:: .. code-tab:: js JSON5 dimension_filters: [ { dimension_type: "geography", dimension_query_name: "county", operator: "==", value: "06037", filter_type: "expression", negate: false, }, ] .. code-tab:: py dimension_filters=[ DimensionFilterExpressionModel( dimension_type=DimensionType.GEOGRAPHY", dimension_query_name="county", operator="==", value="06037", negate=False, ), ] .. code-tab:: py pyspark df.filter("geography == '06037'") 2. Similar to the first but use a raw expression. ``DimensionFilterExpressionModel`` creates a string by inserting the input parameters and adding required quotes. ``DimensionFilterExpressionRawModel`` uses your exact value. This allows you to make a more complex, custom expression. The following example reaches the same result as above. You can use any expression supported by `pyspark.sql.DataFrame.filter `_. .. tabs:: .. code-tab:: js JSON5 dimension_filters: [ { dimension_type: "geography", dimension_query_name: "county", value: "== '06037'", filter_type: "expression_raw", negate: false, }, ] .. code-tab:: py dimension_filters=[ DimensionFilterExpressionRawModel( dimension_type=DimensionType.GEOGRAPHY", dimension_query_name="county", value="== '06037'", negate=False, ), ] .. code-tab:: py pyspark df.filter("geography == '06037'") 3. Filter a table where the specified column matches the specified value(s) according to the Spark SQL operator. This is useful for cases where you want to match partial strings or use a list of possible values. .. tabs:: .. code-tab:: js JSON5 dimension_filters: [ { dimension_type: "model_year", dimension_query_name: "model_year", column: "id", operator: "isin", value: [ "2030", "2040", "2050", ], filter_type: "column_operator" negate: false, }, { dimension_type: "sector", dimension_query_name: "sector", column: "id", operator: "startswith", value: "com", filter_type: "column_operator" negate: false, }, ], .. code-tab:: py dimension_filters=[ DimensionFilterColumnOperatorModel( dimension_type=DimensionType.MODEL_YEAR, dimension_query_name="model_year", column=id, operator="isin", value=[ "2030", "2040", "2050", ], negate=False, ), DimensionFilterColumnOperatorModel( dimension_type="sector", dimension_query_name="sector", column="id", operator="startswith", value="com", filter_type="column_operator" negate=False, ), ] .. code-tab:: py pyspark df.filter(df["model_year"].isin(["2030", "2040", "2050"])) \ .filter(df["sector"].startswith("com")) 4. Filter a table with values from a subset dimension. This example filters the table to include only electricity end uses. .. tabs:: .. code-tab:: js JSON5 dimension_filters: [ { dimension_type: "metric", dimension_query_names: ["electricity_end_uses"], filter_type: "subset" }, ], .. code-tab:: py dimension_filters=[ SubsetDimensionFilterModel( dimension_type=DimensionType.METRIC, dimension_query_names=["electricity_end_uses"], ), ] .. code-tab:: py pyspark # Note that these are example dimension record IDs for demonstration purposes. df.filter(df["end_use"].isin(["electricity_cooling", "electricity_heating"])) 5. Filter a table with records from a supplemental dimension. This example filters the table to include only counties in Colorado or New Mexico. Note that it does not change the dimensionality of the data or perform aggregations. It only tells dsgrid to filter out counties that don't have a mapping in the supplemental dimension records. .. tabs:: .. code-tab:: js JSON5 dimension_filters: [ { dimension_type: "geography", dimension_query_name: "state", column: "id", operator: "isin", value: ["CO", "NM"], filter_type: "supplemental_column_operator" }, ], .. code-tab:: py dimension_filters=[ SupplementalDimensionFilterColumnOperatorModel( dimension_type=DimensionType.GEOGRAPHY, dimension_query_name="state", column="id", operator="isin", value=["CO", "NM"], ), ] .. code-tab:: py pyspark df.filter(df["id"].isin(["CO", "NM"])) 6. Filter a table with times between two timestamps (inclusive on both sides). .. tabs:: .. code-tab:: js JSON5 dimension_filters: [ { dimension_type: "time", dimension_query_name: "time_est", column: "time_est", lower_bound: "2012-07-01 00:00:00", upper_bound: "2012-08-01 00:00:00", filter_type: "between_column_operator" negate: false, }, ], .. code-tab:: py dimension_filters=[ DimensionFilterBetweenColumnOperatorModel( dimension_type=DimensionType.TIME, dimension_query_name="time_est", column="time_est", lower_bound="2012-07-01 00:00:00", upper_bound="2012-08-01 00:00:00", negate=False, ), ] .. code-tab:: py pyspark df.filter(df["timestamp"].between("2012-07-01 00:00:00", "2012-08-01 00:00:00"))