Explain Codes LogoExplain Codes Logo

How do I Pandas group-by to get sum?

python
pandas
dataframe
groupby
Alex KataevbyAlex Kataev·Dec 20, 2024
TLDR

If you're in a rush and need results fast, use df.groupby() with .sum(). For a dataframe df where value_col is summed based on group_col, do:

summed_data = df.groupby('group_col')['value_col'].sum()

To keep other columns in the summed DataFrame, and to return a DataFrame and not a Series:

summed_df = df.groupby('group_col', as_index=False).sum()

This will group and sum by column(s) in just one line of code.

Dealing with multiple columns

You got more than one column to group by? No worries, just list them all like this:

# Grouping by 'Name', 'Fruit' and having the sum of 'Number'. "Yes, we can!" df_summary = df.groupby(['Name', 'Fruit'])['Number'].sum().reset_index()

Using the aggregate function

For a more flexible approach, use the agg() function which supports several aggregations:

# Looks like the only thing 'agg' can't do is make you a cup of coffee df_summary = df.groupby(['Name', 'Fruit']).agg({'Number': 'sum'}).reset_index()

Ever heard of a pivot table?

Pivot tables offer a nice cross-tabulation format, and it's quite straightforward to create one:

# "Pivot" is not just a Friends reference pivot_table = df.groupby(['Name', 'Fruit'])['Number'].sum().unstack(fill_value=0)

This creates a table with unique Names as rows and Fruits as columns while summing 'Number' for each combination, and filling missing values with zero.

Advanced tweaks and potential pitfalls

Let's move to some use cases that need a bit more than just basic syntax:

Summing with conditions

Looking to sum based on a certain condition? Use mask:

# Bananas have feelings too, you know! df['Number'] = df['Number'].mask(df['Fruit'] == 'Banana', 0) grouped_conditional_sum = df.groupby('Fruit')['Number'].sum()

This example sets 'Number' of all Bananas to 0 before summing.

Handling those pesky "NA"s

When NA values are interfering with your data, decide whether or not to exile them with fillna() before summing:

# NA, more like "not applicable" amirite? df['Number'] = df['Number'].fillna(0) grouped_na_handled_sum = df.groupby('Fruit')['Number'].sum()

This will fill your 'NA' values with zeroes before summing.

Performance hacks

For large DataFrames, let's optimize memory usage by dtype conversions before grouping:

# Because "less is more" is also true in Python df['Number'] = df['Number'].astype('int32') grouped_memory_optimized_sum = df.groupby('Fruit')['Number'].sum()

Reducing the column to an int32 type saves memory, improving efficiency while summing.