How to group dataframe rows into list in pandas groupby

python

dataframe

groupby

pandas

byAnton Shumikhin·Nov 1, 2024

Grouping rows into lists with pandas' groupby can be done using the agg() function with list:

import pandas as pd

# Assuming 'df' is your DataFrame and 'group_col' is the column you're grouping by

# Transform rows to lists; it's magic, but it works!
grouped_lists = df.groupby('group_col').agg(lambda x: list(x))

print(grouped_lists)

You will get a DataFrame where each cell is a list of the grouped values. As easy as pie.

More efficient aggregations

For larger datasets, efficiency is crucial. This is where the pd.Series.tolist function comes in handy:

# You'd never believe how quickly this works
grouped_lists_efficient = df.groupby('a')['b'].apply(pd.Series.tolist).reset_index()

Or try grouping multiple columns with a dictionary:

# Make sure you give each column its own ride
grouped_lists_multicol = df.groupby('a').agg({'b': list, 'c': list}).reset_index()

Grouping: beyond `groupby`

For known and not numerous categories, np.unique or array splitting and list comprehension are quick alternatives:

import numpy as np

# Who needs groupby when you've got unique?
unique_groups = np.unique(df['a'])
grouped = [{group: df[df['a'] == group]['b'].tolist()} for group in unique_groups]

Advanced grouping methods

For more complex aggregations, apply multiple functions to the same column:

# One column, so many possibilities
grouped_complex = df.groupby('a')['b'].agg([list, sum, 'mean']).reset_index()

You can also go one step further with custom aggregation functions:

# Your data, your rules
def custom_agg(series):
    # Your custom processing here
    return series.to_list()

grouped_custom = df.groupby('a').agg({'b': custom_agg}).reset_index()

Pro tip: Enhance performance

In large groupby operations, small changes can notably affect performance. Try sorting before grouping:

# Sort before you group. It's like lining up before you board the school bus!
df_sorted = df.sort_values('group_col')
grouped_lists_sorted = df_sorted.groupby('group_col').agg(list)

Benchmark different methods. For small datasets a loop might even be quicker:

# Loops may be vintage, but sometimes they're faster
grouped_values = {k: g['b'].tolist() for k,g in df.groupby('a')}

Also, don't forget that grouping data is not exclusive to pandas

import numpy as np

# Don't be a Pandas snob, numpy can group too
grouped_array = np.split(df['b'].values, np.unique(df['a'], return_index=True)[1][1:])

Deploy the power of lambda

For more tailored data conversion, introduce lambda functions to your groupby operations:

# Lambda: for when built-in functions just won't cut it
grouped_lambda = df.groupby('a')['b'].apply(lambda x: [val for val in x if val > 0]).reset_index()

It opens up possibilities like chaining operations:

# Lambdas love chains, they're edgy like that
grouped_chain_lambda = (df.groupby('a')['b']
                            .apply(lambda x: sorted(set(x)) if x.any() else []) 
                            .reset_index())