How do I Pandas group-by to get sum?
If you're in a rush and need results fast, use df.groupby()
with .sum()
. For a dataframe df
where value_col
is summed based on group_col
, do:
To keep other columns in the summed DataFrame, and to return a DataFrame and not a Series:
This will group and sum by column(s) in just one line of code.
Dealing with multiple columns
You got more than one column to group by? No worries, just list them all like this:
Using the aggregate function
For a more flexible approach, use the agg()
function which supports several aggregations:
Ever heard of a pivot table?
Pivot tables offer a nice cross-tabulation format, and it's quite straightforward to create one:
This creates a table with unique Names as rows and Fruits as columns while summing 'Number' for each combination, and filling missing values with zero.
Advanced tweaks and potential pitfalls
Let's move to some use cases that need a bit more than just basic syntax:
Summing with conditions
Looking to sum based on a certain condition? Use mask
:
This example sets 'Number' of all Bananas to 0 before summing.
Handling those pesky "NA"s
When NA values are interfering with your data, decide whether or not to exile them with fillna()
before summing:
This will fill your 'NA' values with zeroes before summing.
Performance hacks
For large DataFrames, let's optimize memory usage by dtype conversions before grouping:
Reducing the column to an int32 type saves memory, improving efficiency while summing.
Was this article helpful?