Explain Codes LogoExplain Codes Logo

How to group dataframe rows into list in pandas groupby

python
dataframe
groupby
pandas
Anton ShumikhinbyAnton Shumikhin·Nov 1, 2024
TLDR

Grouping rows into lists with pandas' groupby can be done using the agg() function with list:

import pandas as pd # Assuming 'df' is your DataFrame and 'group_col' is the column you're grouping by # Transform rows to lists; it's magic, but it works! grouped_lists = df.groupby('group_col').agg(lambda x: list(x)) print(grouped_lists)

You will get a DataFrame where each cell is a list of the grouped values. As easy as pie.

More efficient aggregations

For larger datasets, efficiency is crucial. This is where the pd.Series.tolist function comes in handy:

# You'd never believe how quickly this works grouped_lists_efficient = df.groupby('a')['b'].apply(pd.Series.tolist).reset_index()

Or try grouping multiple columns with a dictionary:

# Make sure you give each column its own ride grouped_lists_multicol = df.groupby('a').agg({'b': list, 'c': list}).reset_index()

Grouping: beyond groupby

For known and not numerous categories, np.unique or array splitting and list comprehension are quick alternatives:

import numpy as np # Who needs groupby when you've got unique? unique_groups = np.unique(df['a']) grouped = [{group: df[df['a'] == group]['b'].tolist()} for group in unique_groups]

Advanced grouping methods

For more complex aggregations, apply multiple functions to the same column:

# One column, so many possibilities grouped_complex = df.groupby('a')['b'].agg([list, sum, 'mean']).reset_index()

You can also go one step further with custom aggregation functions:

# Your data, your rules def custom_agg(series): # Your custom processing here return series.to_list() grouped_custom = df.groupby('a').agg({'b': custom_agg}).reset_index()

Pro tip: Enhance performance

In large groupby operations, small changes can notably affect performance. Try sorting before grouping:

# Sort before you group. It's like lining up before you board the school bus! df_sorted = df.sort_values('group_col') grouped_lists_sorted = df_sorted.groupby('group_col').agg(list)

Benchmark different methods. For small datasets a loop might even be quicker:

# Loops may be vintage, but sometimes they're faster grouped_values = {k: g['b'].tolist() for k,g in df.groupby('a')}

Also, don't forget that grouping data is not exclusive to pandas

import numpy as np # Don't be a Pandas snob, numpy can group too grouped_array = np.split(df['b'].values, np.unique(df['a'], return_index=True)[1][1:])

Deploy the power of lambda

For more tailored data conversion, introduce lambda functions to your groupby operations:

# Lambda: for when built-in functions just won't cut it grouped_lambda = df.groupby('a')['b'].apply(lambda x: [val for val in x if val > 0]).reset_index()

It opens up possibilities like chaining operations:

# Lambdas love chains, they're edgy like that grouped_chain_lambda = (df.groupby('a')['b'] .apply(lambda x: sorted(set(x)) if x.any() else []) .reset_index())