Explain Codes LogoExplain Codes Logo

Search for "does-not-contain" on a DataFrame in pandas

python
dataframe
functions
best-practices
Anton ShumikhinbyAnton Shumikhin·Jan 20, 2025
TLDR

Require to cleanse a DataFrame of rows where our chosen column ('column') does not contain 'exclude_this'?

result = df[~df['column'].str.contains('exclude_this', na=False)]

Utilise ~ to indicate "does-not-contain", ensuring we capture everything excluding 'exclude_this'. The na=False attribute makes sure we also include rows with NaN values (which ordinarily might get lost).

Deep dive into filtering strategies

Combo breaker: Tackling multiple patterns

When you've got multiple words to exclude, it's time for a regular expression tag-team, using | (OR):

patterns = 'exclude_this|also_this|and_this_too' result = df[~df['column'].str.contains(patterns, na=False)] # Bingo! All unwanted patterns KO'd.

Ensure your patterns string is fully armed with the necessary patterns, divided by | symbols to form a powerhouse of pattern removal.

Don't yell: Case insensitivity

Should all cases ('EXCLUDE_THIS', 'Exclude_This', and 'exclude_this') be treated as equals, add case=False attribute for a case-insensitive search:

result = df[~df['column'].str.contains('exclude_this', case=False, na=False)] # Let "mood" be dictated by content, not casing.

Show-off time: Lambda for complex filtering

When you feel the need to introduce advanced logic—perhaps excluding multiple words—lambda functions have got your back:

result = df[df['column'].apply(lambda x: all(word not in x for word in ['word1', 'word2']))] # Special forces of filtering, upgraded.

Potential fix: How to deal with NULLs and TypeErrors

Before filtering with negation, make sure your DataFrame is sterilized from TypeError-inducing null values and annoying variances in data types:

df['column'] = df['column'].fillna('default_value') # No NaN, just neat data. df['column'] = df['column'].astype(str) # Keeping consistent for morale. String, it's the way to go! result = df[~df['column'].str.contains('exclude_this')] # Ah, smooth sailing.

Boundless conditionality: loc method

Got complicated conditions to meet? Why not use loc, the swiss-knife of data filtering:

result = df.loc[~df['column'].str.contains('exclude_this', na=False) & (df['another_column'] > 50)] # 'column' must not contain 'exclude_this' + 'another_column' > 50 = Double trouble!