Explain Codes LogoExplain Codes Logo

How to split a dataframe string column into two columns?

python
dataframe
regex
extract
Anton ShumikhinbyAnton Shumikhin·Nov 18, 2024
TLDR

str.split() is the basic tool in your toolbox, used with expand=True to split a dataframe column.

# For those who skipped their breakfast and want quick bites # Assuming 'data' is your dataframe and 'column' is the complex sandwich to split data[['bread', 'filling']] = data['column'].str.split(' ', 1, expand=True)

Here, data is now served with two fresh columns bread and filling. Enjoy your meal!

Taking splits seriously: Using regex and extract()

Splitting is not always as easy as halving a sandwich. Sometimes you need to surgically extract parts with precision. Harness the power of str.extract() with regex:

# Do you remember doing surgery in Operation game? Bringing those childhood skills back! df[['first', 'second']] = df['column'].str.extract(r'(\w+)\W+(\w+)')

Your regex skills matter here! Make sure patterns match the desired content. Here's a power-up: named groups

df[['first', 'second']] = df['column'].str.extract(r'(?P<first>\w+)\W+(?P<second>\w+)')

Handling missing values

Uneven splits throw NaN messy pie at you. Make sure you have your apron on!

df['column'].str.split(' ', expand=True).notnull().all(axis=1)

This wipes the mess and checks for rows without missing values in the new split columns.

Naming and moving on: Rename, join, drop

After splitting, move on like a pro. Chain rename() and join() to your dataframe:

# The weirdest conga line ever happening here! df = df.join(df['column'].str.split(' ', 1, expand=True) .rename(columns={0: 'first', 1: 'second'}))

Once separated, break up officially with your original column with df.drop():

# It's not you, it's me. Goodbye 'column'... df.drop('column', axis=1, inplace=True)

Tricky splits? No problem!

Don’t panic when your splitting task looks like a puzzle. Regex is the master key!

df['column'].str.split('\s+', expand=True)

This searches for hidden spaces inside your data!

Maintain the sanity of your data. During splitting dates and times, double-check that you aren’t distorting any formats:

df[['date', 'time']] = df['datetime'].str.split('T', expand=True)

Using .str.get() or .str[index] accesses elements after split, like opening an easter egg to find the inside treats!

# Want a surprise egg? Here you go... df['first_word'] = df['column'].str.split().str.get(0)

Creative extraction & the power of extractall()

For those who love to dig deeper, using .str.extractall() opens up a world of possibilities:

matches = df['column'].str.extractall(r'(\b\w+\b)')

This method comes handy when your column is like a rabbit hole filled with surprises!