Lazy Method for Reading Big File in Python?

python

performance

best-practices

tools

byAlex Kataev·Dec 22, 2024

Read large files efficiently using with and for in Python:

with open('huge_file.txt') as file:
    for line in file:
        # It's a bird! It's a plane! No, it's your processing function!
        process(line)

Replace process(line) with your processing function. This code snippet keeps memory usage minimal by loading only one line into memory at a time.

For even more control over chunk sizes and the ability to include additional logic, you can modify this to include a generator using the yield keyword:

def read_in_chunks(file_path, chunk_size=1024*1024):
    with open(file_path, 'r') as file:
        while True:
            data = file.read(chunk_size)
            if not data:
                break
            yield data

Use it like:

for chunk in read_in_chunks('huge_file.txt'):
    # This is where the magic happens
    process(chunk)

This approach is beneficial when working with binary files or when processing text files that do not align with line-orientations.

Big Binary Files? No problem

For binary data or images, where line-by-line processing doesn't work, opt to read in fixed-sized chunks. Tune the size of these chunks based on system capabilities, maintaining both efficiency and avoiding memory overload.

Dealing with Custom Delimiters in Text Files

If your text files use a non-standard row separator, you might need a custom function that reads and yields each 'line' based on this custom delimiter.

def custom_readline(f, delimiter='\n'):
    buffer = []
    while True:
        chunk = f.read(4096) 
        if not chunk:
            # Give me your last words!
            yield ''.join(buffer)
            break

        buffer.append(chunk)
        # On a hunt for the elusive delimiter!
        position = ''.join(buffer).find(delimiter)
        if position != -1:
            yield ''.join(buffer)[:position]
            buffer = [''.join(buffer)[position + len(delimiter):]]

Tapping into mmap for File Access

64-bit systems can benefit from the mmap module, especially when files are too big to fit in memory. Memory mapping a file avoids copies, speeding up parsing for large files — but be alert to addressing issues on 32-bit systems:

import mmap
with open('huge_file.txt', mode='r') as f:
    mm = mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)
    for line in iter(mm.readline, b''):
        # Hello line, you're mine now
        process(line)
    # Bye for now!
    mm.close()

Control the Flow with Buffers

Adjust buffer size in the open() function to control the amount of data per read. A smaller buffer can prove beneficial when dealing with slow networks or large data streams:

buffer_size = 64  # Travel light!
with open('big_file.txt', 'rb', buffering=buffer_size) as file:
    for bite in iter(lambda: file.read(buffer_size), b''):
        process(bite)  # Every byte you take, every read you make, I'll be watching you!

Taking the Short Route with Assignment Expressions

Python 3.8's assignment expressions, dubbed the "walrus operator", let you create readable loops:

with open('huge_file.txt', 'r') as file:
    while (line := file.readline()):
        # If it exists, we can process it.
        process(line)

Testing the Waters with Chunk Sizes

Experimenting with chunk sizes can help strike the right balance between performance and memory management. Here’s a small code snippet to conduct this experiment:

chunk_sizes = [1024, 2048, 4096, 8192, 16384]
for size in chunk_sizes:
    # Time to test the waters!
    start_time = time.time()
    for chunk in read_in_chunks('huge_file.txt', chunk_size=size):
        process(chunk)
    print(f"Chunk size {size} took {time.time() - start_time} seconds")

Storing the Processed Data Safely

Store each processed chunk in a separate file or a database to reduce the risk of data loss in case of a failure, and allow for interrupting and resuming the processing job in a more controlled way.