Explain Codes LogoExplain Codes Logo

Python recursive folder read

python
file-handling
pathlib
io-best-practices
Nikita BarsukovbyNikita Barsukov·Nov 28, 2024
TLDR

os.walk() is a powerful directory tree traversal function in Python. Below is a quick sample that will list all files beneath a given path:

import os for dirpath, _, filenames in os.walk('./target_dir'): for filename in filenames: print(os.path.join(dirpath, filename))

Replace './target_dir' with your target directory. This will print the full paths of all files found under the directory, traversing recursively through all subdirectories.

Optimization tips

os.walk() is great, but dealing with filesystem operations often requires a little more thought. Here's some practical advice:

  • Dynamic paths: Use os.path.abspath() to convert relative paths into absolute ones consistently.
  • File handling: Use the with statement when dealing with files to take care of their lifecycle.
  • Avoid naming conflicts: Refrain from using file as a variable name to prevent shadowing built-in types.
  • Path concatenation: Favor os.path.join() over string concatenation when dealing with paths.
  • Filtering files: Use glob.iglob() with a '**' pattern to filter files by their extensions (available in Python 3.5+).

Check out this optimized example:

import os import glob # Transformers, roll out...to the absolute path! start_dir = os.path.abspath('./target_dir') for file_path in glob.iglob(start_dir + '**/*.txt', recursive=True): with open(file_path, 'r') as f: # Here be dragons (or files, actually) print(f'Processing {file_path}')

The resulting code is easier to maintain, more readable, and just looks cooler.

Pathlib fun

Made available in Python 3.4, the pathlib module provides object-oriented filesystem paths. Here's how you'd rewrite the previous snippet using pathlib:

from pathlib import Path # What's absolute, helpful, and makes a path a path? That's right, it's Path.resolve()! start_dir = Path('./target_dir').resolve() # Look, no need for glob.iglob! for file_path in start_dir.rglob('*.txt'): with file_path.open('r') as f: # More file processing. Just my type! 🤣 print(f'Processing {file_path}')

Are you feeling pathos towards pathlib yet?

How to handle I/O like a pro

While crawling directories is fun, you'll probably need to handle files at some point. Here are some best practices:

  • File modes: When opening a file, specify the mode explicitly ('r', 'w', etc.).
  • Error handling: Use try-except blocks to catch I/O errors and handle them gracefully.
  • Resource management: Context managers (with blocks) automatically cleanup resources once done.

Try the following approach:

try: with Path('./target_dir/file.txt').open('r') as f: # In space, no one can hear you read data data = f.read() except IOError as e: print(f'This does not compute! An I/O error occurred: {e.strerror}')

Safe, efficient traversals

For more guidance, consider:

  • Absolute paths: Safely handle paths passed as command-line arguments with os.path.abspath().
  • Check existence: Before proceeding, verify that paths exist using os.path.exists() or Path.exists().
  • Performance: For deep directory structures, compare os.walk(), os.scandir(), and pathlib to pick the most efficient.

By following these steps, your code will not just be functionally correct, but also effective and swift!

Advanced glob patterns

If you need to filter files by extension, Python 3.5+ allows you to do so with recursive=True:

# A wild glob appears. It uses Pattern Matching. It's super effective! for file_path in Path('.').rglob('*.py'): print(file_path) # Print after landing a critical hit 🎮

Keep in mind, the trailing slash '/' in your directory paths will ensure the patterns match accurately.