Explain Codes LogoExplain Codes Logo

Detect and exclude outliers in a pandas DataFrame

python
pandas
outliers
data-cleaning
Alex KataevbyAlex Kataev·Aug 16, 2024
TLDR

Here's a quick way to remove outliers from a DataFrame, leveraging the Interquartile Range (IQR).

# Calculate IQR thresholds Q1, Q3 = df['Data'].quantile([0.25, 0.75]) IQR = Q3 - Q1 lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR # these bounds are not feeling "out of bound" # Filter out outliers df_clean = df[(df['Data'] >= lower) & (df['Data'] <= upper)] # only good data in this party!

The df_clean DataFrame is now free from outliers, filtered efficiently using IQR boundaries.

Advanced techniques for diverse data

Just as you wouldn't use a hammer to fix a computer, some data situations demand a more tailored approach to handle outliers. Here are a few techniques to handle those situations:

Z-score-: Handling Normality like a pro

For normally distributed data, the Z-score is your buddy. It measures how many standard deviations a data point is from the mean.

from scipy import stats import numpy as np # Calculate Z-scores z_scores = np.abs(stats.zscore(df)) # here, in the kingdom of normality # Exclude outliers (just saying we don't like extremes here) df_clean_z = df[(z_scores < 3).all(axis=1)] # 3 is not a crowd, it's a threshold!

Roll 'em for Time-Series

Time-series data has a tinge of drama with serial correlation. A rolling window approach takes care of this sequel saga.

# Define a rolling window size window_size = 5 # Roll with lambda to knock outliers out rolling_z = lambda x: x[np.abs(stats.zscore(x)) < 3] df_clean_rolling = df.rolling(window=window_size).apply(rolling_z)

Handling the skewed ones with robust methods

Life isn't fair, neither is data. Some distributions are skewed. Enter Median and IQR.

# Calculate the median and IQR because mean is 'mean' median = df['Data'].median() Q1, Q3 = df['Data'].quantile([0.25, 0.75]) IQR = Q3 - Q1 # Define robust bounds with IQR lower, upper = median - 1.5 * IQR, median + 1.5 * IQR # Filter using those robust bounds df_robust = df[(df['Data'] >= lower) & (df['Data'] <= upper)] # robustness for the win

The Art of dealing with outliers

Removing outliers brings harmony to your data, similar to decluttering a room:

Before: 🛹🧸📕🎱❗🐠👟🦖 After: 🛹🧸📕🎱🐠👟

Like exiling the T-Rex toy 🦖 and any excess punctuation ❗, we tidy the DataFrame with:

df = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)] # Data cleaning, it's not just any task, it's a mission

Data quality improves, enhancing the analysis while silencing the noise.

Dealing with caveats, because 'Data happens'

Conditional replacement and the 'Keep the Size' Challenge

When you can't afford to remove outliers but still need to handle them, replace outliers with central values or NaN.

# Replace outliers with NaN df_conditional = df.where((df >= lower) & (df <= upper), np.nan) # Drop NaN if necessary df_conditional_dropped = df_conditional.dropna() # dropping any "NaNsense"

Scaling Outliers in Multiple Dimensions

In multivariate data, outliers tend to play hide and seek. Consider scaling and applying PCA to find these mischievous data points!

Outliers in Categorical Data

Ever wonder how to deal with outliers in categorical data? Convert them using techniques such as one-hot encoding, label encoding, or even fancier methods like embeddings!