Explain Codes LogoExplain Codes Logo

What's the fastest way to do a bulk insert into Postgres?

sql
bulk-insert
postgresql
performance-tuning
Anton ShumikhinbyAnton Shumikhin·Jan 15, 2025
TLDR

Experience lightning-fast bulk insertions into Postgres using the COPY command, the optimal tool for managing voluminous data. Execute this directly:

COPY your_table FROM '/path/to/data.csv' WITH (FORMAT csv);

Ensure the CSV aligns with your database schema. For in-app data transfer, libraries like Psycopg2 for Python provide copy_from(), facilitating seamless bulk insertion:

import psycopg2 with psycopg2.connect("dbname=your_db user=your_user") as conn: with conn.cursor() as cur, open('/path/to/data.csv', 'r') as f: cur.copy_from(f, 'your_table', sep=",") # Welcome aboard, fast lane awaits!

Get ready to surpass the limitations of INSERT operations as your data enters the database at ultimate speed.

Setting up the faster corridors

Before boarding the COPY train, it's beneficial to tweak a few settings to clear the path for a speedier landing:

  • Indexes: Keep them on hold before the mass migration, and then bring them back. This paves the path for faster data transfer, as updating index structures (a time-consuming task) gets cut off from the load process.

  • Foreign Key Constraints: Temporarily removing them escalates the process. Make sure to re-apply them once finished to avoid data integrity issues later.

All aboard for high speed transit

Employing multi-row INSERTS and batch easing

Multi-row INSERT syntax offers a reliable alternative to COPY for less cumbersome operations:

INSERT INTO your_table (columns) VALUES (row1_value1, row1_value2), (row2_value1, row2_value2), ... (rowN_value1, rowN_value2); # Hitching a ride together

To gain momentum, disable autocommit, morphing your insertions into batches. PostgreSQL thus scampers across several rows at once, alleviating I/O overhead.

Turbocharging with array data

For array data, the combination of the UNNEST function and multi-row VALUES makes a potent pair:

INSERT INTO your_table (col1, col2) SELECT * FROM UNNEST (array1, array2); # Unleashing the power of UNNEST!

When working on Python lists before insertion, transform these into Postgres arrays for slick bulk operations.

Fast landing and integration

Temporary tables

Creating a temporary table provides a secure pit-stop for data, especially potent when dealing with CSV or JSON formats:

CREATE TEMP TABLE temp_your_table AS SELECT * FROM your_table WITH NO DATA; # Temporary, but still a part of the family!

Migrate the data into these tables using COPY, then drive them into main database via INSERT INTO ... SELECT.

Pgloader - Sound of speed

Pgloader breaks the speedometer, swiftly fetching data from STDIN and effortlessly ingesting compressed files:

pgloader /path/to/data.csv postgresql:///your_db?tablename=your_table # On the Pgloader express now

It even chimes with gzip compression, saving your storage and bandwidth for large data sets.

Tweaking the engine

WAL buffers tuning and prepared statements

Elevate wal_buffers size to a peak of 16MB if you're battling with an enormous bulk insertion. More room for write-ahead logging (WAL) data fuels insert operations.

Tailor the limits on prepared statements to accommodate the scale of the bulk insertion. This enables Postgres to recycle insert plans across similar batches.

Overcoming insert hurdles

Data integrity and error management

Safeguard a flawless bulk operation by mastering these plans:

  • Pre-Validate: Ensure your data is in sync with table schema and constraints.
  • Transaction execution: Surround your operations with a transaction. If something goes haywire, rollback!
  • Error logging: Implement a tracker for your process. Collect intel on failures, constraint violations etc.

Post-insertion must dos

After a successful landing, don't miss out on these steps:

  • Revive the indexes: Required to restore the query performance on tampered tables.
  • Vacuum the space: Consider running VACUUM ANALYZE to brush up the database performance.
  • Data validity check: Final inspection to affirm that all data adheres to the constraints and there's been no trade-off with integrity during the bulk operation.