Explain Codes LogoExplain Codes Logo

Filter dataframe rows if value in column matches a set list of values

python
dataframe
filtering
performance
Nikita BarsukovbyNikita Barsukov·Oct 28, 2024
TLDR
# Pretend 'df' is your DataFrame and 'col' the questionable identity matches = {1, 3, 5} filtered_df = df.query('col in @matches') # Muscle: 'col'! Are you in '@matches'? Answer: Yes, No, Maybe.

Key points:

  • @matches accesses the matches variable inside the query() function. But don't tell anyone, it's a secret.
  • Simplicity: Bypass the .isin() drama; keep it short, keep it clean!

When to use isin — Oh no, not the set!

Use isin when you exhibit symptoms of chronic scrolling syndrome caused by nested loops. It comes bearing direct row selection against a list of criteria.

values_list = [1, 3, 5] filtered_df = df[df['column'].isin(values_list)] # 'col' joined "IsIn" Anonymous. No more loops. Just lists.

Just remember, isin is your dad's Extremely-Specific Set of Skills that will find and will filter your DataFrame rows.

Sniffing out patterns with regular expressions

Is it just me or there is a strong whiff of partial matches in the air?

Have a crack at str.contains:

df[df['flower'].str.contains('sunflower|maple', case=False, regex=True)] # Detectives at work: Looking for 'sunflower' or 'maple'. Case doesn't matter. They're not picky.

Words of Wisdom: Use case=False for case-insensitive hunts and | for hunting parties.

Logically filtering with operators

Feeling strained by control issues? Divide and conquer with logical operators '&', '|' and '~' for 'and', 'or', 'not'!

filtered_df = df[(df['value'] > 10) & (df['value'] < 20)] # When the going gets tough, the tough get 'and' going!

Filtering efficiently in a numeric range? Yes, please!

Performant filtering for data the size of a mammoth

With datasets larger than the universes where 'col' exists in one measly list, try to use .loc with .isin!

filtered_df = df.loc[df['column'].isin(values_list)] # Survived "University of Large Analysis" with a `.loc` degree!

Pro tip: Use .loc for feeling less memory poor and more speed rich!

Create masks — Put on your face pack, dear DataFrame!

Masks are deep-cleansing boolean facials that leave your DataFrame squeaky clean.

mask = df['column'].isin(values_list) # Lather on the mask. filtered_df = df[mask] # Rinse and repeat.

Skincare Guru Tip: Variable masks for complex filtering leaves DataFrames radiant and code squeaky clean!

Merging (AKA Dealing with Big Data)

Got dynasty-sized datasets? Merge like it's a reality TV show:

filtered_df = pd.merge(large_df, values_df, on='key_column') # Big data meet 'merge'. 'merge', meet "Got 99 Problems but Speed Ain't One".