Explain Codes LogoExplain Codes Logo

Change column type in pandas

python
pandas
dataframe
type-conversion
Nikita BarsukovbyNikita Barsukov·Aug 6, 2024
TLDR

For a quick type conversion in pandas DataFrame, use the astype() function. Converting a column to a string:

df['col'] = df['col'].astype(str) # Now the column is string, it can't do math!

Handling NaNs using nullable integers:

df['col'] = df['col'].astype('Int64') # Int64 not 65, sorry it's not old enough!

For better memory efficiency, switch to categorical:

df['col'] = df['col'].astype('category') # Cat lover but doggos are welcome too.

And if you have multiple columns to wrangle:

df[['col1', 'col2']] = df[['col1', 'col2']].astype('type') # This code is a multi-tasker, like you!

Step-by-step conversions

Surviving the Jungle of Numeric Conversion

If your data is like a jungle with numerical data hiding in strings, pd.to_numeric() is your machete:

df['col'] = pd.to_numeric(df['col'], errors='coerce', downcast='float') # Float like a butterfly, sting like a bee.

To keep your data intact, use errors='ignore':

df['col'] = pd.to_numeric(df['col'], errors='ignore') # Keep calm and ignore errors.

The Gentle Giant of Object Conversion

df.infer_objects() is this gentle giant that can upgrade 'object' dtype to more specific types:

df = df.infer_objects() # Like finding a needle in a haystack...but easier.

Convert dtypes: Your trusty toolbox

df.convert_dtypes() is your trusty toolbox that identifies the right tool (type) and uses it:

df = df.convert_dtypes() # Work smarter, not harder!

Leave out automatic type inference when not necessary:

df = df.convert_dtypes(infer_objects=False) # Because sometimes, you want to drive the car yourself.

Diving the depths of type casting

Avoid sinking your data by casting it to the right dtype. Ensuring safe casting is crucial:

if df['col'].min() < 0: raise ValueError("Column contains negative values, cannot use unsigned type") df['col'] = df['col'].astype('uint32', errors='ignore') # Unsigned and proud!

Knowing Your dtypes

Before any type conversion, a quick glance at your current dtypes using df.dtypes:

print(df.dtypes) # Like reading the ingredients before cooking.

Then convert only the necessary columns:

df['col'] = df['col'].astype('new_type') # like changing the flavor of your data, yum!

There are 'hard' and 'soft' conversions, understand the difference:

# soft conversion, smooth as butter df['col'] = df['col'].astype('uint32', errors='ignore') # hard conversion, like your workout df['col'] = df['col'].astype('uint32')

Handling mixed columns

Have columns with mixed types or numeric literals? pd.to_numeric() with errors='coerce' clears the clutter:

df['col'] = df['col'].apply(pd.to_numeric, errors='coerce') # Clutter? What clutter?

Memory optimization

To save memory, use pd.to_numeric(df['col'], downcast='integer'):

df['col'] = pd.to_numeric(df['col'], downcast='integer') # Because downcast is the new cool!

Converting to string or categoricals can optimize memory too!

df['col'] = df['col'].astype('category') # Like a zoo but for your data. df['col'] = df['col'].astype(str) # 'cos words have power!

Pitfalls of bad casting

Avoid data corruption by not forcing the wrong dtype:

# if it's a negative, it's a no-go for casting to unsigned numbers! if df['col'].min() < 0: raise ValueError("Column contains negative values, cannot use unsigned type")

Powerful performance strategies

Prioritize columns for type casting. Handle few columns now for a performance boost!