Explain Codes LogoExplain Codes Logo

Filtering a PySpark DataFrame using the SQL IN clause

python
dataframe
filtering
pyspark
Nikita BarsukovbyNikita Barsukov·Jan 22, 2025
TLDR

You can easily filter rows in a PySpark DataFrame where a column has a certain set of values using isin():

# 'cause every party needs guests... vals = [1, 3, 5] # Party where 'col' can be any of these guests filtered_df = df.filter(df['col'].isin(vals))

This will return rows with 'col' populated by 1, 3, or 5. Just think of it as hosting a party with your favorite imaginary number buddies!

Dealing with differing data types and case insensitivity

Remember, in the game of type matching, data must be of the same ilk - integers stick with integers, strings with strings. For case-insensitive matching, consider applying lower():

from pyspark.sql.functions import col, lower # Apple a day keeps the doctor away, but don't forget bananas and cherries... vals_str = ["Apple", "banana", "Cherry"] # Lowercase party for all your fruity friends - let's put case-sensitivity in time-out! filtered_df_str = df.filter(lower(col("fruit")).isin([x.lower() for x in vals_str]))

Leveraging broadcast for performance boost

Dealing with a big set of values? Broadcast to the rescue! Use Spark's broadcast function to make the most of your available resources:

from pyspark.sql.functions import broadcast # Party for 10000, anyone? vals_large = range(10000) # Shout-out your party invite (aka. broadcasting the large list to optimize filter performance) bc_vals = spark.sparkContext.broadcast(list(vals_large)) filtered_df_large = df.filter(col("id").isin(bc_vals.value))

Embracing dynamic and secure queries for smarter selections

isin() delivers flexibility enabling real-time adaptations and safeguards you from manual errors:

# When the party list is a top-secret that comes from a secret source dynamic_vals = get_dynamic_value_list_somehow() # don't ask, can't tell filtered_dynamic_df = df.filter(col("category").isin(dynamic_vals)) # Even secret parties need the right guests

Embedding variables in SQL code should be done with caution, because injection attacks and errors lurk for incautious souls.

Enlighten yourself with explain

Curious how your DataFrame cuts the mustard or crushes operations? Deploy explain(), your GPS in the world of execution plans:

# Party planning itinerary filtered_df.explain()

From the resulting physical plan, spot oppurtunities to optimize your filter. Consider it your checklist to party planning goodness!

Filters and their relationship with data types

The syntax for filters depends on whether you're dealing with integer or string data types:

  • Integers: Directly pass the values to isin().
  • Strings: Use single quotes, but remember, case-insensitivity comes with lower().

Transitioning between DSL and SQL smoothly

DataFrame DSL affords more adaptability than SQL strings, but SQL enthusiasts can use temporary views to stay loyal:

# The View to kill for df.createOrReplaceTempView("df_table") # The partying SQL style query = """SELECT * FROM df_table WHERE col IN (1, 3, 5)""" result = spark.sql(query)

Dynamic query creation can be a cause for concern if not handled cautiously, leading to lesser secure scenarios in SQL as compared to the DSL.