Explain Codes LogoExplain Codes Logo

How do I read a large csv file with pandas?

python
dataframe
pandas
performance
Anton ShumikhinbyAnton Shumikhin·Feb 13, 2025
TLDR

Marvel the efficiency and memory prosperity of pandas when chunksize is utilized for reading large CSV files. All gigantic line counts are tamed with this mighty chunk-by-chunk processing.

import pandas as pd # Don't bite off more than you can chew chunk_size = 10000 # Your CSV reader just went on a diet chunks = pd.read_csv('large_file.csv', chunksize=chunk_size) # time for some chunk-by-chunk magic for chunk in chunks: # This is where the magic happens...

Ensure your memory can have its cake and eat it too with the added efficiency of incremental loading and processing.

Primary strategies for efficient processing

Context manager for memory-conscious processing

For pandas version 1.2 and beyond, you can automate the clean-up after processing chunks with the use of context management:

with pd.read_csv('large_file.csv', chunksize=chunk_size) as reader: for chunk in reader: # Every magical operations you want to perform

Cut down unnecessary load: usecols and dtype usage

Minimize the memory footprint by filtering necessary columns alone and specifying their data types before the pandas digests:

selected_types = {'id': 'int32', 'value': 'float32'} selected_cols = ['id', 'value', 'stamp'] chunks = pd.read_csv('large_file.csv', usecols=selected_cols, dtype=selected_types, chunksize=chunk_size)

Harness the power of categorical data types to treat memory like royalty.

Collective wisdom: Grouping and aggregation

Pandas groupby to the rescue for deciphering meaningful data through aggregation:

aggregated_chunks = [] for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size): aggregated = chunk.groupby(['category']).agg('sum') aggregated_chunks.append(aggregated) final_df = pd.concat(aggregated_chunks)

Scale-up with distributed frameworks

Don’t shy away from exploring distributed dataframe libraries like dask.dataframe and modin. They are pandas' doppelgangers but they spread the computational work across your CPU cores or even multiple computing resources.

Advanced strategies and optimizations for large CSV files

Pickle as your sidekick

For multi-stage operations, save your chunks as pickle files (crisp and savoury data files, mind you) using pickle.HIGHEST_PROTOCOL for swift processing:

for i, chunk in enumerate(chunks): # Pickles aren't just for burgers! chunk.to_pickle(f'tasty_chunk_{i}.pkl')

Fetch these pickles for a comprehensive data analysis with the glob module.

Your code speed-o-meter: Runtime tracking

Keep your eye on the runtime to hunt down and fix those sneaky performance bottlenecks.

Date-time rescue

Loading your CSV file with date parsing and save the later processing hustle using the parse_dates option:

chunks = pd.read_csv('large_file.csv', parse_dates=['timestamp'], chunksize=chunk_size)

Direct read from S3

No need for middlemen; pandas can fetch CSV files straight from an S3 bucket. Just pass the 's3://.../path/to/file.csv' path in pd.read_csv.

Time-bound indexing

Applying an early index to your dataframe can leave tardiness in the dust and fast track operations like sorting etc.

Lambda as your data janitor

Utilize the might of your custom lambda functions for per-row calculations:

chunk['new_column'] = chunk.apply(lambda row: my_func(row), axis=1)

Meet pd.read_table

Not all files are comma-separated. Don't sweat if you have a tab-separated file. Use pd.read_table that lends the superpowers of chunking and column naming:

chunks = pd.read_table('large_file.tsv', names=['col1', 'col2'], chunksize=chunk_size)