Creating an empty Pandas DataFrame, and then filling it
You can create an empty DataFrame with pd.DataFrame(columns=['A', 'B', 'C'])
and populate it using df.loc[]
for a one by one row addition. Or you can append multiple rows at once with df = df.append({}, ignore_index=True)
.
Example:
Building your data collection framework
Accumulate Data in Lists
Begin by collecting data in simple list
structures. Why? It's cheaper and more memory-efficient so you can append until you can't anymore and create a DataFrame later.
Unleash the power of numpy
When you're dealing with numerical time series data, call numpy
for its raw mathematical power and perform computations there, then transform it into a pandas DataFrame.
Split the workload with batch processing
For gigantic datasets, split your data and use batching. It's all about portion control, making it easier on the memory and bulk up on performance.
Structuring your DataFrame for the long haul
Begin with end in mind
When starting to initialize your DataFrame, simply state the column names using pd.DataFrame(columns=...)
. Steer clear from initializing with NaNs, unless necessary.
Life is too short for slow code
Avoid iteratively appending rows to a DataFrame inside a loop. Also, remember that df.append()
is heading for retirement in pandas>=2.0. So, it's good to update your approaches!
Run Forrest, run!
Keep an eye on your DataFrame memory usage especially for the larger ones. Choosing efficient data types can help you break the wall of inefficiency.
Time is money
Always remember to benchmark your methods. Time them with %timeit
and enjoy your coffee while your code runs faster.
Tips to fill up your DataFrame
Zeroes first, questions later
If you want to begin with a DataFrame pre-filled with zeroes rather than NaNs, simply create one using df = pd.DataFrame(np.zeros((rows,cols)))
.
Concatenation is your friend
Using pd.concat
can keep your indexes in check, which is why throwing ignore_index=True
into the mix gets rid of the index confetti.
Loop like a pro
Iterating a DataFrame can be a necessary evil. For such cases, using .iloc
for faster operations is your best bet.
Then, you can take these thrifty frames and concatenate them in bulk together.
Documentation is your guiding light
The latest pandas documentation is your go-to guide for wisdom. It's updated regularly to include deprecated or freshly optimized methods.
Was this article helpful?