How do I read a large csv file with pandas?

python

dataframe

pandas

performance

byAnton Shumikhin·Feb 13, 2025

Marvel the efficiency and memory prosperity of pandas when chunksize is utilized for reading large CSV files. All gigantic line counts are tamed with this mighty chunk-by-chunk processing.

import pandas as pd

# Don't bite off more than you can chew
chunk_size = 10000 

# Your CSV reader just went on a diet
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

# time for some chunk-by-chunk magic
for chunk in chunks:
    # This is where the magic happens...

Ensure your memory can have its cake and eat it too with the added efficiency of incremental loading and processing.

Primary strategies for efficient processing

Context manager for memory-conscious processing

For pandas version 1.2 and beyond, you can automate the clean-up after processing chunks with the use of context management:

with pd.read_csv('large_file.csv', chunksize=chunk_size) as reader:
    for chunk in reader:
        # Every magical operations you want to perform

Cut down unnecessary load: `usecols` and dtype usage

Minimize the memory footprint by filtering necessary columns alone and specifying their data types before the pandas digests:

selected_types = {'id': 'int32', 'value': 'float32'}
selected_cols = ['id', 'value', 'stamp']
chunks = pd.read_csv('large_file.csv', usecols=selected_cols, dtype=selected_types, chunksize=chunk_size)

Harness the power of categorical data types to treat memory like royalty.

Collective wisdom: Grouping and aggregation

Pandas groupby to the rescue for deciphering meaningful data through aggregation:

aggregated_chunks = []

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    aggregated = chunk.groupby(['category']).agg('sum')
    aggregated_chunks.append(aggregated)

final_df = pd.concat(aggregated_chunks)

Scale-up with distributed frameworks

Don’t shy away from exploring distributed dataframe libraries like dask.dataframe and modin. They are pandas' doppelgangers but they spread the computational work across your CPU cores or even multiple computing resources.

Advanced strategies and optimizations for large CSV files

Pickle as your sidekick

For multi-stage operations, save your chunks as pickle files (crisp and savoury data files, mind you) using pickle.HIGHEST_PROTOCOL for swift processing:

for i, chunk in enumerate(chunks):
    # Pickles aren't just for burgers!
    chunk.to_pickle(f'tasty_chunk_{i}.pkl')

Fetch these pickles for a comprehensive data analysis with the glob module.

Your code speed-o-meter: Runtime tracking

Keep your eye on the runtime to hunt down and fix those sneaky performance bottlenecks.

Date-time rescue

Loading your CSV file with date parsing and save the later processing hustle using the parse_dates option:

chunks = pd.read_csv('large_file.csv', parse_dates=['timestamp'], chunksize=chunk_size)

Direct read from S3

No need for middlemen; pandas can fetch CSV files straight from an S3 bucket. Just pass the 's3://.../path/to/file.csv' path in pd.read_csv.

Time-bound indexing

Applying an early index to your dataframe can leave tardiness in the dust and fast track operations like sorting etc.

Lambda as your data janitor

Utilize the might of your custom lambda functions for per-row calculations:

chunk['new_column'] = chunk.apply(lambda row: my_func(row), axis=1)

Meet `pd.read_table`

Not all files are comma-separated. Don't sweat if you have a tab-separated file. Use pd.read_table that lends the superpowers of chunking and column naming:

chunks = pd.read_table('large_file.tsv', names=['col1', 'col2'], chunksize=chunk_size)

explain-codes / Python / How do I read a large csv file with pandas?

Linked

Import multiple CSV files into pandas and concatenate into one DataFrame



Python pandas to_sql with sqlalchemy: how to speed up exporting to MS SQL?



Pandas convert dataframe to array of tuples



Pandas read in table without headers



How can I explicitly free memory in Python?



Creating a dictionary from a csv file?



Load data from txt with pandas



Primary strategies for efficient processing Advanced strategies and optimizations for large CSV files