Detect and exclude outliers in a pandas DataFrame

python

pandas

outliers

data-cleaning

byAlex Kataev·Aug 16, 2024

Here's a quick way to remove outliers from a DataFrame, leveraging the Interquartile Range (IQR).

# Calculate IQR thresholds
Q1, Q3 = df['Data'].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR  # these bounds are not feeling "out of bound"

# Filter out outliers
df_clean = df[(df['Data'] >= lower) & (df['Data'] <= upper)]  # only good data in this party!

The df_clean DataFrame is now free from outliers, filtered efficiently using IQR boundaries.

Advanced techniques for diverse data

Just as you wouldn't use a hammer to fix a computer, some data situations demand a more tailored approach to handle outliers. Here are a few techniques to handle those situations:

Z-score-: Handling Normality like a pro

For normally distributed data, the Z-score is your buddy. It measures how many standard deviations a data point is from the mean.

from scipy import stats
import numpy as np

# Calculate Z-scores
z_scores = np.abs(stats.zscore(df))  # here, in the kingdom of normality 

# Exclude outliers (just saying we don't like extremes here)
df_clean_z = df[(z_scores < 3).all(axis=1)]  # 3 is not a crowd, it's a threshold!

Roll 'em for Time-Series

Time-series data has a tinge of drama with serial correlation. A rolling window approach takes care of this sequel saga.

# Define a rolling window size
window_size = 5

# Roll with lambda to knock outliers out
rolling_z = lambda x: x[np.abs(stats.zscore(x)) < 3] 
df_clean_rolling = df.rolling(window=window_size).apply(rolling_z)

Handling the skewed ones with robust methods

Life isn't fair, neither is data. Some distributions are skewed. Enter Median and IQR.

# Calculate the median and IQR because mean is 'mean'
median = df['Data'].median()  
Q1, Q3 = df['Data'].quantile([0.25, 0.75]) 
IQR = Q3 - Q1

# Define robust bounds with IQR
lower, upper = median - 1.5 * IQR, median + 1.5 * IQR 

# Filter using those robust bounds
df_robust = df[(df['Data'] >= lower) & (df['Data'] <= upper)]  # robustness for the win

The Art of dealing with outliers

Removing outliers brings harmony to your data, similar to decluttering a room:

Before: 🛹🧸📕🎱❗🐠👟🦖 After: 🛹🧸📕🎱🐠👟

Like exiling the T-Rex toy 🦖 and any excess punctuation ❗, we tidy the DataFrame with:

df = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]  # Data cleaning, it's not just any task, it's a mission

Data quality improves, enhancing the analysis while silencing the noise.

Dealing with caveats, because 'Data happens'

Conditional replacement and the 'Keep the Size' Challenge

When you can't afford to remove outliers but still need to handle them, replace outliers with central values or NaN.

# Replace outliers with NaN  
df_conditional = df.where((df >= lower) & (df <= upper), np.nan)

# Drop NaN if necessary
df_conditional_dropped = df_conditional.dropna()  # dropping any "NaNsense"

Scaling Outliers in Multiple Dimensions

In multivariate data, outliers tend to play hide and seek. Consider scaling and applying PCA to find these mischievous data points!

Outliers in Categorical Data

Ever wonder how to deal with outliers in categorical data? Convert them using techniques such as one-hot encoding, label encoding, or even fancier methods like embeddings!

explain-codes / Python / Detect and exclude outliers in a pandas DataFrame

Linked

Use a list of values to select rows from a Pandas dataframe



Delete the first three rows of a dataframe in pandas



How to select all columns except one in pandas?



How to drop rows of Pandas DataFrame whose value in a certain column is NaN



How to delete rows from a pandas DataFrame based on a conditional expression

