Creating an empty Pandas DataFrame, and then filling it

python

pandas

dataframe

batch-processing

byAlex Kataev·Aug 16, 2024

You can create an empty DataFrame with pd.DataFrame(columns=['A', 'B', 'C']) and populate it using df.loc[] for a one by one row addition. Or you can append multiple rows at once with df = df.append({}, ignore_index=True).

Example:

import pandas as pd

df = pd.DataFrame(columns=['A', 'B', 'C'])  # Create empty DataFrame
df.loc[0] = [1, 2, 3]                       # James Bond driving in row 

# "Ya know one more" spiderman row addition
df = df.append({'A': 4, 'B': 5, 'C': 6}, ignore_index=True)

Building your data collection framework

Accumulate Data in Lists

Begin by collecting data in simple list structures. Why? It's cheaper and more memory-efficient so you can append until you can't anymore and create a DataFrame later.

data_brick = []
for _ in range(10000):  # It's the cookie clicker data collection!
    data_brick.append({'A': 1, 'B': 2, 'C': 3})
df = pd.DataFrame(data_brick)  # It's a DataFrame masterpiece now

Unleash the power of numpy

When you're dealing with numerical time series data, call numpy for its raw mathematical power and perform computations there, then transform it into a pandas DataFrame.

import numpy as np

timestamps = pd.date_range('20210101', periods=10000)
data = np.random.randn(10000, 3)  # Assume this is data collection over time
df = pd.DataFrame(data, index=timestamps, columns=list('ABC'))  # Behold the mighty DataFrame

Split the workload with batch processing

For gigantic datasets, split your data and use batching. It's all about portion control, making it easier on the memory and bulk up on performance.

batch_of_goodies = []
for batch_index in range(0, len(large_data_set), batch_size):
    batch = pd.DataFrame(large_data_set[batch_index:batch_index + batch_size])
    batch_of_goodies.append(batch)
df = pd.concat(batch_of_goodies, ignore_index=True)  # Voila, it's all in one!

Structuring your DataFrame for the long haul

Begin with end in mind

When starting to initialize your DataFrame, simply state the column names using pd.DataFrame(columns=...). Steer clear from initializing with NaNs, unless necessary.

Life is too short for slow code

Avoid iteratively appending rows to a DataFrame inside a loop. Also, remember that df.append() is heading for retirement in pandas>=2.0. So, it's good to update your approaches!

Run Forrest, run!

Keep an eye on your DataFrame memory usage especially for the larger ones. Choosing efficient data types can help you break the wall of inefficiency.

df = pd.DataFrame(data, columns=['A', 'B', 'C'], dtype='float32')  # Lower memory footprint

Time is money

Always remember to benchmark your methods. Time them with %timeit and enjoy your coffee while your code runs faster.

Tips to fill up your DataFrame

Zeroes first, questions later

If you want to begin with a DataFrame pre-filled with zeroes rather than NaNs, simply create one using df = pd.DataFrame(np.zeros((rows,cols))).

Concatenation is your friend

Using pd.concat can keep your indexes in check, which is why throwing ignore_index=True into the mix gets rid of the index confetti.

df1 = pd.DataFrame(data_list_one)
df2 = pd.DataFrame(data_list_two)
df_fast_and_furious = pd.concat([df1, df2], ignore_index=True)  # Voila, a well-bred DataFrame

Loop like a pro

Iterating a DataFrame can be a necessary evil. For such cases, using .iloc for faster operations is your best bet.

# Temporary DataFrame for faster looping, speedy Gonzalez style
temp_df = pd.DataFrame(index=range(batch_size), columns=df.columns)
for i in range(batch_size):
    temp_df.iloc[i] = some_function_returning_series()

Then, you can take these thrifty frames and concatenate them in bulk together.