Explain Codes LogoExplain Codes Logo

Split (explode) pandas dataframe string entry to separate rows

python
dataframe
pandas
data-science
Nikita BarsukovbyNikita Barsukov·Oct 20, 2024
TLDR

Want to split a column of CSVs into multiple rows swiftly? Use pd.Series.explode() in Pandas. Given a DataFrame df with 'Values' column, use .str.split(',') to slice the string into lists, then .explode():

df['Values'] = df['Values'].str.split(',') df = df.explode('Values') # Now your DataFrame has more rows than a corn farmer!

This morphs:

   Values
0  a,b,c
1  d,e,f

Into these separate rows:

  Values
0      a
0      b
0      c
1      d
1      e
1      f

You'll need Pandas 0.25.0 or later for this trick, though multi-column explosions require Pandas 1.3.0. For an older pandas version, you'd need to get creative.

Digging deeper: Advanced techniques and pitfalls

From singularity to multiverse: Single vs Multi-column explode

In the version 1.3.0 and onwards, Pandas lets you explode multiple columns simultaneously:

df = df.explode(['Column1', 'Column2']) # pandas just went from single to ready-to-mingle.

Don't mix up: Handling non-string columns

Exploding non-string columns without exploding your sanity is easy. Just keep'em unchanged during the split:

df['NonStringCol'] = df['NonStringCol'].astype(original_dtype) # Just like your unchanging love for pandas.

The need for speed: Account for efficiency

Performance is key with large data science, so run %timeit and pick the approaches that won't have you growing old waiting:

%timeit myDataFrame.explode('Values') # I've seen glaciers move faster than this function!

Popping the stack and the concat can

When reshape is the game, stacking and concatenating provide alternative plays to get your data in formation:

split_df = (df.set_index(['Fixed1', 'Fixed2']) .apply(lambda x: x.str.split(',').explode()) .reset_index()) # Transforming data into a malleable form since...right now.

iterrows are bad, m'kay?

Although iterrows may seem like a good idea, it's a trap. Pandas advises to avoid this clunky method for larger datasets. When in doubt, vectorize!

Explode like a pro

Tidy data and explode: Best buddies

In tidy dataframe land, each variable is a column and each observation, a row. After the explode command, make sure your dataframe is still tidy:

df.rename(columns={new_col_name: 'Original_Col_Name'}).sort_index(axis=1) # Gotta keep it clean and tidy.

Beyond pandas: tidy_split

Robust beyond measure, third-party functions like tidy_split retain column order and handle missing values efficiently.

Picking the right tool

Explore the Python community's wisdom to choose the solution that's best for your specific case. Consider data structure, dataframe size, and the Pandas version.