Filter dataframe rows if value in column matches a set list of values

python

dataframe

filtering

performance

byNikita Barsukov·Oct 28, 2024

# Pretend 'df' is your DataFrame and 'col' the questionable identity
matches = {1, 3, 5}
filtered_df = df.query('col in @matches')  # Muscle: 'col'! Are you in '@matches'? Answer: Yes, No, Maybe.

Key points:

@matches accesses the matches variable inside the query() function. But don't tell anyone, it's a secret.
Simplicity: Bypass the .isin() drama; keep it short, keep it clean!

When to use `isin` — Oh no, not the set!

Use isin when you exhibit symptoms of chronic scrolling syndrome caused by nested loops. It comes bearing direct row selection against a list of criteria.

values_list = [1, 3, 5]
filtered_df = df[df['column'].isin(values_list)]  # 'col' joined "IsIn" Anonymous. No more loops. Just lists.

Just remember, isin is your dad's Extremely-Specific Set of Skills that will find and will filter your DataFrame rows.

Sniffing out patterns with regular expressions

Is it just me or there is a strong whiff of partial matches in the air?

Have a crack at str.contains:

df[df['flower'].str.contains('sunflower|maple', case=False, regex=True)]  # Detectives at work: Looking for 'sunflower' or 'maple'. Case doesn't matter. They're not picky.

Words of Wisdom: Use case=False for case-insensitive hunts and | for hunting parties.

Logically filtering with operators

Feeling strained by control issues? Divide and conquer with logical operators '&', '|' and '~' for 'and', 'or', 'not'!

filtered_df = df[(df['value'] > 10) & (df['value'] < 20)]  # When the going gets tough, the tough get 'and' going!

Filtering efficiently in a numeric range? Yes, please!

Performant filtering for data the size of a mammoth

With datasets larger than the universes where 'col' exists in one measly list, try to use .loc with .isin!

filtered_df = df.loc[df['column'].isin(values_list)]  # Survived "University of Large Analysis" with a `.loc` degree!

Pro tip: Use .loc for feeling less memory poor and more speed rich!

Create masks — Put on your face pack, dear DataFrame!

Masks are deep-cleansing boolean facials that leave your DataFrame squeaky clean.

mask = df['column'].isin(values_list)  # Lather on the mask.
filtered_df = df[mask]  # Rinse and repeat.

Skincare Guru Tip: Variable masks for complex filtering leaves DataFrames radiant and code squeaky clean!

Merging (AKA Dealing with Big Data)

Got dynasty-sized datasets? Merge like it's a reality TV show:

filtered_df = pd.merge(large_df, values_df, on='key_column')  # Big data meet 'merge'. 'merge', meet "Got 99 Problems but Speed Ain't One".

explain-codes / Python / Filter dataframe rows if value in column matches a set list of values

Linked

How to test if a string contains one of the substrings in a list, in pandas?



Drop columns whose name contains a specific string from pandas DataFrame



Search for "does-not-contain" on a DataFrame in pandas



How do I select rows from a DataFrame based on column values?



Use a list of values to select rows from a Pandas dataframe



How to determine whether a Pandas Column contains a particular value



Filter pandas DataFrame by substring criteria



When to use — Oh no, not the set!Sniffing out patterns with regular expressions Logically filtering with operators Performant filtering for data the size of a mammoth Create masks — Put on your face pack, dear DataFrame!Merging (AKA Dealing with Big Data)