Explain Codes LogoExplain Codes Logo

Normalize columns of a dataframe

python
dataframe
pandas
data-preprocessing
Anton ShumikhinbyAnton Shumikhin·Sep 17, 2024
TLDR

For quick results, use Pandas for min-max normalization or StandardScaler from sklearn for z-score standardization:

import pandas as pd from sklearn.preprocessing import StandardScaler # Check your standard deviations before z-score standization, just to be sure ;) df_zscore = pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns) # As easy as calculating min-max, but be careful with the minus sign ;) df_minmax = (df - df.min()) / (df.max() - df.min())

Choose wisely: Z-score for data with mean = 0 and std = 1, or min-max to scale the party between 0 and 1.

Understanding normalization

What is normalization?

In 3 words: Scaling without distortion. Normalization doesn't spoil the party but changes the scale of the dance floor.

Different normalization techniques

  • Min-max normalization: Fits the dance into the [0, 1] room.
  • Z-score standardization: Zeros the mean and standardizes the dance steps.

The art of choosing the right normalization

Different dances (algorithms) need different floors (normalization). The key is knowing the concert requirements and dancers' distribution.

For a 'larger than life' machine learning performance, use the apply() function for more control. It allows you to normalize each column as you please, using lambda functions:

df_custom = df.apply(lambda x: (x - x.mean()) / x.std() if x.std() != 0 else x, axis=0) # "If I had a dollar for every time x.std() equals zero..."

Practical nuances

Tackling negative values

For data in both positive and negative range, min-max formula adapts. It's like a chameleon, changing color but keeping personality.

When to move on to sklearn

For 'YUGE' datasets and Pandas-Scikit-Learn romantic duets, MinMaxScaler is your foreman.

Pandas vs. sklearn: Stick or Shift?

Pandas keeps DataFrames intact, while sklearn deliveries are numpy-array wrapped.

Always conduct a post-treatment review

A post-normalization wellness checkup is inevitable. Make sure to check the minimum and maximum values of each column and keep an eye out for intruder constants.

Know the power of scaling

Normalization is King Midas of feature scaling. It cannot change your donkey dataset into a unicorn, but hey, it gives a golden touch to the model performance.

Beware the Leaky Cauldron

Prevent data leakage! The scaler should only be fitted with the training set and then used to transform both the train and test sets.