Explain Codes LogoExplain Codes Logo

Pandas DataFrame: replace nan values with average of columns

python
pandas
dataframe
fillna
Nikita BarsukovbyNikita Barsukov·Dec 31, 2024
TLDR

Quickly fill NaN values in a pandas DataFrame with column averages using:

df.fillna(df.mean(), inplace=True)

This efficient line of code both computes column means and fills the missing data with those averages, modifying the DataFrame in place. (It's just that easy!)

Filling gaps with column averages

Why turn column averages into a dictionary?

The fillna method is a versatile function that can take a scalar, a dict, or a Series as value argument. To allow for selective behavior, consider turning your column averages into a dictionary:

mean_values = df.mean().to_dict() # Conjuring a dictionary of column averages df.fillna(mean_values, inplace=True)

This lets you pick and choose where to apply the averages, giving you granular control over column treatment.

Saving computational resources in large DataFrames

When wrestling with extensive DataFrames, selectively filling NaN values can give a boost to performance:

nan_columns = df.columns[df.isna().any()].tolist() # Selecting the columns having NaN values df[nan_columns] = df[nan_columns].fillna(df.mean())

The code identifies columns with NaNs and addresses them specifically, saving quite a few computational cycles.

Working with lambda and apply()

For column-wise precision and control, lambda functions with apply() function is a good choice:

df = df.apply(lambda x: x.fillna(x.mean()) if x.isna().any() else x)

This ensures each column is evaluated individually, applying fillna() only when necessary. Coding ninjas will love this functionality!

Use cases and warnings

When average fits well

Columns with zero variance (all non-NaN values are the same), filling with the column mean makes sense as it aligns with the existing data. The context of the dataset will ultimately determine the appropriateness of such imputations.

A word of caution

While filling missing data with column average is convenient and prevalent, it's important to remember this can influence the results of your subsequent analysis. So, always tread into the world of statistical inference with caution.

Other available approaches

Pandas dataframe provides various other techniques, like mean, median, mode, and more complex techniques for replacing missing data. Always consider what fits best for your data.