How to cheaply count lines in a large file in Python?

python

prompt-engineering

performance

best-practices

byAnton Shumikhin·Jan 11, 2025

A quick, memory-efficient way to count lines in large files with Python:

with open('large_file.txt', 'r') as file:
    line_count = sum(1 for _ in file)  # Who needs a for loop anyway?
print(line_count)

It opens a file, iteratively counts lines 'on the fly', and it's like an energy drink for processing large files.

Enhancing the basic approach: Binary mode and beyond

The simplest solutions don't always meet the demands of the real world. Here's how to tweak your approach further:

Binary mode for optimal performance

Opening files in binary mode ('rb') can amp up memory efficiency and, simultaneously, reduce processing lag:

with open('large_file.txt', 'rb') as file:
    line_count = sum(1 for _ in file)  # Code so efficient, it sips memory like a fine wine
print(line_count)

Counting lines at the speed of `mmap`

For an extra kick of speed, consider memory-mapped files via the mmap module. It's like taking a warp drive through your large file:

import mmap

with open('large_file.txt', 'r+') as file:  # Open sesame!
    mmapped_file = mmap.mmap(file.fileno(), 0)  # File slurped into memory-mapped object
    line_count = 0 
    while mmapped_file.readline():  # Gotta catch 'em all!
        line_count += 1
mmapped_file.close()  # Goodbye, my lover. Goodbye, my friend...
print(line_count)

Leverage Unix with subprocess

Flex your Unix muscle with subprocess and wc -l for the heavyweight champion of line counting:

import subprocess

proc = subprocess.Popen(['wc', '-l', 'large_file.txt'], stdout=subprocess.PIPE)
out, err = proc.communicate()
if proc.returncode == 0:
    line_count = int(out.partition(b' ')[0])
    print(line_count)  # We didn't start the fire, it was always burning since the world's been turning
else:
    print("Error")  # Houston, we have a problem

Taking your line counting to the next level

Of course, there's always another level to beat for those who dare strive. Let's turbocharge our line counting.

Improving efficiency through buffer management

By fine-tuning buffer size, you can streamline counting and make your operation I/O-bound:

with open('large_file.txt', 'r', buffering=1<<16) as file:
    line_count = sum(1 for _ in file)  # 1-up for efficiency!
print(line_count)

Profiting from mmap

The mmap module offers direct scores by accessing the file directly in memory:

# mmap for the win!

Playing the profiler game

Prolific profiling with cProfile or timeit reveals the true king of the counting hill:

# Profile, tweak, repeat – the Pythonista's mantra for code prowess

Watch your step, coder: Pitfalls and hiccups

As you march on this treacherous path, watch out for these quirks and bottlenecks:

Portability matters

Remember, methods like subprocess with wc -l are Unix-exclusive; for code that runs everywhere, stick to Python-native approaches.

Mind your memory

Note that mmap, while good for speed, might falter for astronomical files exceeding available memory.

Guided by the Pythonic North Star

Python versions alter over time. For instance, the universal newline mode 'U' got canned post Python 3.11:

# Keep Python version in mind, lest "'U've got to move it, move it."

explain-codes / Python / How to cheaply count lines in a large file in Python?

Linked

How can I read large text files line by line, without loading them into memory?



Lazy Method for Reading Big File in Python?



How to search for a string in text files?



How do you read a file into a list in Python?



Read file from line 2 or skip header row



How to read specific lines from a file (by line number)?



How do I concatenate text files in Python?



Enhancing the basic approach: Binary mode and beyond Taking your line counting to the next level Watch your step, coder: Pitfalls and hiccups