Explain Codes LogoExplain Codes Logo

Import multiple CSV files into pandas and concatenate into one DataFrame

python
pandas
dataframe
vectorized-operations
Anton ShumikhinbyAnton Shumikhin·Feb 23, 2025
TLDR

Pandas and glob can be combined to import data from multiple CSV files and merge it into a single DataFrame. The *.csv pattern will match all CSV files in a directory, pd.read_csv() reads each file and pd.concat() brings them all together:

import pandas as pd import glob # Fetch all CSV files (Feels like a CSV treasure hunt!) csv_files = glob.glob('*.csv') # Concatenate into one DataFrame (One DataFrame to rule them all) combined_df = pd.concat((pd.read_csv(f) for f in csv_files), ignore_index=True)

The real world isn't always tidy, and neither are file directories. Use os.path.join() with an r prefix to allow for cross-platform compatibility and escape sequence interpretation:

import os # Full path ahead! (Unlike my career...) file_path = os.path.join(r'your_directory', '*.csv') csv_files = glob.glob(file_path)

To track the source of data in the final DataFrame, use assign to add a new identifier column:

combined_df = pd.concat( (pd.read_csv(f).assign(filename=os.path.basename(f)) for f in csv_files), ignore_index=True )

For path handling bliss, consider pathlib to turn paths into easy-to-handle objects:

from pathlib import Path # pathlib handling File paths. (It's ridiculously easy. Trust me!) p = Path(r'your_directory') csv_files = p.glob('*.csv') combined_df = pd.concat((pd.read_csv(f) for f in csv_files), ignore_index=True)

Let's dig deeper

Taking Concatenation to the Next Level

Save important metadata, like filenames, using the assign method:

combined_df = pd.concat( (pd.read_csv(f).assign(source_file=f) for f in csv_files), ignore_index=True )

Watch Your Memory

Reading huge files at once can blow your memory budget. Use generator expressions:

combined_df = pd.concat( (pd.read_csv(f, chunksize=10000) for f in csv_files), ignore_index=True )

The chunksize can be tuned according to your system's memory.

Dust off those CSVs before Merging

Sometimes CSV files have to be massaged before they fit well together. Preprocess file data according to your needs:

def preprocess_file(filename): # Insert your data massaging magic here df = pd.read_csv(filename) # ... some more magic ... return df combined_df = pd.concat( (preprocess_file(f) for f in csv_files), ignore_index=True )

What's next in the journey?

Let's look into the DataFrame Mirror

Use combined_df.describe() or combined_df.head() to glance at your beautiful DataFrame creation.

Having Trouble? Let's Debug

Here are some starting points to debug common issues:

  • Mismatched columns: Ensure all CSVs share the same column structure.
  • Encoding issues: Specify the encoding type within pd.read_csv().
  • File not found errors: Verify the file path and pattern.

Strive for Efficiency

Try map or list comprehensions with pd.concat for efficient, tidy code. Also, remember the power of vectorized operations, when at need of adding new columns.