Explain Codes LogoExplain Codes Logo

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

python
dataframe
list-comprehensions
performance
Anton ShumikhinbyAnton Shumikhin·Nov 1, 2024
TLDR

If you've got no time to read because the coffee is getting cold and your dataset is hotter than a machine learning startup in 2022, here's the SQL mirror in the Pandas world 🐼:

# 'IN' equivalent in Pandas: this is like finding a needle in the haystack filtered_in = df[df['column'].isin(['desired', 'values'])] # 'NOT IN' equivalent in Pandas: removing those pesky needles from the haystack filtered_not_in = df[~df['column'].isin(['excluded', 'values'])]

Short and to the point, this lets you swim through rows like a dolphin through the ocean.

Boosting performance with numpy and sets

If your filtering needs are as numerous as the grains of sand on a beach, you may need to up the ante on speed. In such cases, numpy.isin and set conversion might be the flux capacitor to your performance DeLorean:

import numpy as np # 'IN' filtering with numpy: like cutting through butter with a hot knife filtered_in_np = df[np.isin(df['column'], {1, 2, 3})] # 'NOT IN' filtering with numpy: like removing unwanted cookies from the cookie jar filtered_not_in_np = df[~np.isin(df['column'], {4, 5, 6})]

Remember, though, it's not about the size but how you use it 😉. Test these performance boosts according to your specific data adventure.

Making your life easier with DataFrame.query

Who doesn't love SQL-like querying 🍕? Let .query() method take you down the easy—and efficient—road:

values_to_select = ['desired', 'values'] values_to_exclude = ['excluded', 'values'] # 'IN' logic using query: like finding your favorite on the menu filtered_in_query = df.query('column in @values_to_select') # 'NOT IN' logic using query: like telling the waiter what you want to avoid filtered_not_in_query = df.query('column not in @values_to_exclude')

When life gets complex—as it tends to do—the .query() method has your back. With numexpr for performance, it's the beach chair of data frame filtering, especially for large data.

Unleashing the power of comprehensions

For those moments when your filtering looks more like rocket science, list comprehensions bring you flexibility:

# Example for string partial match 'IN': like finding all happy face emojis in a chat log filtered_in_comprehension = df[[x in ['part1', 'part2'] for x in df['column']]] # Custom 'NOT IN' condition: like avoiding all the grumpy face emojis filtered_not_in_comprehension = df[[not x.startswith('exclude') for x in df['column']]]

Wielding the power of Python's list comprehensions, you can tailor-make your filtering to your analytical desires.

Use caution signs ⚠️: pitfalls and edge cases

Dealing with potential trip-up steps is part of every data journey. Here are some slip-ups to avoid:

  • Always handle missing values (NaN) with care: they can make isin trips and falls.
  • If you .query() with user inputs, protect your code from SQL injection attacks.
  • Heed the SettingWithCopyWarning: prevent this error with proper boolean indexing.

Visualization: 'IN' and 'NOT IN' in pictures


Let's visualize 'IN' and 'NOT IN' like a traffic light controlling the flow of data:

Data Stream (🚗): Contains ['Car1', 'Car2', 'Car3', 'Car5'] Include Filter (🟢): Include these ['Car2', 'Car4'] Exclude Filter (🔴): Exclude these ['Car1', 'Car3']

'IN': Shows the specified cars allowed to proceed in the Green Light:

🚗➡️🟢: ['Car2']

'NOT IN': Displays the cars remaining after the Red Light has held back specified vehicles:

🚗➖🔴: ['Car2', 'Car5']

A picture is worth a thousand lines of code, isn't it?

Let's dive deeper: advanced tips

Merge vs. isin

When faced with filtering based on another DataFrame's values, you may be pondering between .merge() and .isin(). It's like choosing between pizza and pasta—they are both good but serve a different purpose!

  • Choose .merge() if you like 2 for 1 deals: get filtering plus data from another DataFrame.
  • Go with .isin() if you want a trip to the grocery store: you just want to pick what you need.

Creative condition inversion

To be or not to be? Inverse conditions are not just about ~. There're other ways:

to_exclude_set = set([4, 5, 6]) # 'NOT IN' using sets: It's like a party exclusive for everyone but the gatecrashers filtered_not_in_sets = df[df['column'].apply(lambda x: x not in to_exclude_set)]

This method offers more clarity and readability especially when your data is as complex as a Rubik's cube.

Custom filters row-wise

Sometimes, you need to apply a conditional filter that includes multiple columns—or a full row. In such cases, apply() with axis=1 is your trusted ally:

# Custom row-wise filter: It's like putting together a puzzle where each piece matters filtered_custom_rows = df[df.apply(lambda row: row['col1'] in ['A', 'B'] and row['col2'] > 5, axis=1)]

This enables powerful complex, row-level condition evaluation. It's your personal assistant in multi-tasking.