Multiple functions on multiple groupby columns – The Mastery

python

dataframe

lambda

best-practices

byAlex Kataev·Dec 14, 2024

The fast way to apply multiple functions to different columns in a DataFrame after conducting groupby is to build a dict with columns as keys and a list of functions as values. Feed this dict into agg() for an efficacious execution.

import pandas as pd

# Define DataFrame
df = pd.DataFrame({
    'Group': ['A', 'A', 'B', 'B'], 
    'Data1': [10, 20, 30, 40], 
    'Data2': [100, 200, 300, 400]
})

# Multifaceted Aggregation dictionary (the more the merrier!)
multi_agg = {
    'Data1': ['sum', 'mean'],
    'Data2': ['max', 'min']
}

# Group by and aggregate - like a data party
grouped_result = df.groupby('Group').agg(multi_agg)

print(grouped_result)

This script groups the DataFrame by 'Group', calculates the sum and mean for 'Data1', and highest and lowest value for 'Data2'.

Spice up the game with custom functions

Pre-defined functions are like vanilla ice-cream, pleasant but maybe too plain. Add some toppings by creating custom functions.

def my_custom_agg(series):
    return pd.Series({'CustomSum': series.sum(), 'CustomMean': series.mean()})

grouped_custom = df.groupby('Group')['Data1'].apply(my_custom_agg)

This creates and implements custom aggregation functions, offering you a whole new level of flexibility.

Lambdas on the fly

The beauty of lambda functions is they empower you to put data manipulation on steroids:

grouped_lambda = df.groupby('Group').agg({
    'Data1': lambda x: x.max() - x.min(),  # The highs and the lows.
    'Data2': lambda x: sum(x > 250)  # Counting up like a real winner.
})

Say your name: Named aggregations

In pandas 0.25.0 and above, named aggregations can be leveraged for cleaner syntax and more clear column naming:

grouped_named = df.groupby('Group').agg(
    total=('Data1', 'sum'),
    average=('Data1', 'mean'),
    maximum=('Data2', 'max'),
    above_threshold=('Data2', lambda x: (x > 250).sum())
)

Naming aggregation will help not only in more readable code but also better data output structure.

Best practices

Deprecated .ix indexer

Beware of .ix, it has been deprecated! Instead, go for .loc and .iloc for faultless data manipulation.

No more dict of dicts for agg

Avoid passing dict of dicts to agg(). Opt for flatter structures or lambda functions for clearer and more stable code.

When the groups are interdependent

When you have interdependencies within your calculation, this is where things get interesting:

def complex_agg(group):
    max_val = group['Data2'].max()  # Catch you at the peak!
    group['Modified'] = group['Data1'] + max_val  # Climbing up the objectives.
    return group

enhanced_groups = df.groupby('Group').apply(complex_agg)

It's not just about handling independent operations, but also strategically managing interdependencies.

Embrace the MultiIndexes

MultiIndexes can look terrifying but only till you get used to it. If you plan to return a Series with one during aggregation, it's time to start embracing them for good.

functools.partial(), The Savior

In desperate times when artificial constraints apply, remember functools.partial(). This aid helps in passing extra arguments to functions while dealing with GroupBy.agg.

Fine-tune your lambda

Workaround limitations by accessing other column values within a lambda function or use methods like pd.Series() to generate better outputs. Remember, in Python, there is always a workaround!

grouped_trick = df.groupby('Group').agg({
    'Data1': lambda x: (x + df.loc[x.index, 'Data2']).mean()
})

Keep updated with documentation

The secret to becoming a Pythonista is not just about coding, but also about staying updated with the documentation. The updates can sometimes feel like plot twists!

Dynamic column naming

In dynamic times, why leave column names static. Also, do not forget to utilize the special __name__ attribute for dynamic column naming.

Easier way around manual iterations

Instead of manually iterating through the groups, opt for .agg() and .apply(). Efficiency is not one-size-fits-all; it is the perfect size for everyone!

explain-codes / Python / Multiple functions on multiple groupby columns – The Mastery

Linked

Multiple aggregations of the same column using pandas GroupBy.agg()



How to loop over grouped Pandas dataframe?



Pandas groupby, then sort within groups



Pandas Percentage of Total with GroupBy



Getting group-wise statistics (count, mean, etc.) using pandas GroupBy

