Explain Codes LogoExplain Codes Logo

Label encoding across multiple columns in scikit-learn

Alex KataevbyAlex Kataev·Nov 27, 2024

Use OrdinalEncoder from scikit-learn to encode multiple columns in one go:

from sklearn.preprocessing import OrdinalEncoder encoder = OrdinalEncoder() df[['col1', 'col2', 'col3']] = encoder.fit_transform(df[['col1', 'col2', 'col3']])

Successfully apply uniform encoding to all your categorical data in dataframe df.

Customizing your approach

Handling mixed data types with ColumnTransformer

Retain the granularity of your preprocessing by selectively applying LabelEncoder on different columns via ColumnTransformer:

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder ct = ColumnTransformer( [('oh_enc', OneHotEncoder(sparse=False), ['col1', 'col3']), # performs magic on col1, col3 ('rest', 'passthrough', ['col2'])]) # col2 is 'un-magical' enough transformed = ct.fit_transform(df)

Building flexibility with custom encoders

Architect custom encoders for maximum control over already numeric attributes and data structure. Implement fit, transform, and fit_transform methods:

from sklearn.base import BaseEstimator, TransformerMixin class CustomLabelEncoder(BaseEstimator, TransformerMixin): def __init__(self): self.encoders = {} # "I've got a lovely bunch of LabelEncoders..." def fit(self, X, y=None): for column in X.columns: le = LabelEncoder() le.fit(X[column]) self.encoders[column] = le # "...here they are standing in a row" return self def transform(self, X): X = X.copy() # "first principle in life: never leave a copy behind" for column in X.columns: le = self.encoders[column] X[column] = le.transform(X[column]) return X def fit_transform(self, X, y=None): return self.fit(X, y).transform(X) # "...Cause fit and transform are always better together..." # Usage: custom_encoder = CustomLabelEncoder() df_encoded = custom_encoder.fit_transform(df)

Maximizing efficiency with direct encoding

astype('category').cat.codes from Pandas can be an ace up your sleeve for larger categorical datasets:

df['col1'] = df['col1'].astype('category').cat.codes # "col1, you've just been officially labeled!"

Advanced encoding techniques

Ensuring accurate inverse transformation

To prevent loss in translation during inverse_transform, equip yourself with a dictionary-based LabelEncoder:

encoders = {col: LabelEncoder().fit(df[col]) for col in df.columns} # "Talk about being in two places at the same time!" encoded_df = df.apply(lambda col: encoders[col.name].transform(col)) # "Life's a column, and then you DF.apply" # For inverse transformation decoded_col = encoders['col1'].inverse_transform(encoded_df['col1']) # "Oh, Col1, how I missed you!"

Handling missing values: a necessary evil

Missing values are like unwanted party crashers. Customize your encoders to handle them gracefully and maintain train validity of your data.

Optimizing your encoding solutions

Quick and easy encoding with pandas.get_dummies

When dealing with classification problems, convert categorical variables into dummy/indicator variables using the trusty pandas.get_dummies method:

encoded_df = pd.get_dummies(df, columns=['col1', 'col3']) # "col1, col3 - you're about to have your binary moment of fame!"

Numerical preprocessing using RobustScaler

RobustScaler from scikit-learn gracefully handles outliers when scaling numerical columns:

from sklearn.preprocessing import RobustScaler scaler = RobustScaler() df['col2'] = scaler.fit_transform(df[['col2']]) # "I've just put some robust into your col2. You're welcome."

Non-repetitive encoding with mapping dictionary

Tired of repetitive labeling? Take control and deliver non-repetitive encoding using a mapping dictionary:

mappings = {c: {k: i for i, k in enumerate(df[c].unique(), 1)} for c in df.columns} # "Because who likes repeating themselves?" df = df.apply(lambda col: col.map(mappings[col.name])) # "Dear columns, prepare to be mapped!"