Explain Codes LogoExplain Codes Logo

Finding and Deleting Duplicate Rows in PostgreSQL

sql
performance-tuning
database-optimization
postgresql
Anton ShumikhinbyAnton Shumikhin·Dec 24, 2024
TLDR

Eliminate duplicates in PostgreSQL using a Common Table Expression (CTE) with ROW_NUMBER(), targeting row numbers greater than 1 for deletion:

WITH duplicates_cte AS ( SELECT id, ROW_NUMBER() OVER (PARTITION BY column_to_dedupe ORDER BY id) AS rn FROM table_name /* This gives us a nice ordered list of potential troublemakers */ ) DELETE FROM table_name WHERE id IN (SELECT id FROM duplicates_cte WHERE rn > 1); /* Goodbye duplicates! */

Customize column_to_dedupe, id, and table_name according to your needs.

Performance Tuning for Hefty Tables

When large tables are in play, performance isn't just a nice-to-have — it's mandatory. To fine-tune the deletion process:

  1. Set up an indexed temporary table to keep tabs on duplicate IDs.
  2. Launch the DELETE operation, feeding it data from the temporary table for a lightweight execution.

Upgrading PostgreSQL for Compatibility

Looking to test your solution in a safe sandbox like SQL Fiddle? Don't be shy to upgrade your PostgreSQL to version 9.3 or higher — it's hip, it's cool, and it's fully compatible with SQL Fiddle.

Catered Data Retention

If you're in the business of keeping data around, for example, the earliest or latest entries, make the ORDER BY clause in the ROW_NUMBER() function your best friend:

ORDER BY created_at DESC -- keeps the freshest ORDER BY created_at ASC -- keeps the pioneers

Composite Key Duplicates

When you're facing duplicates across a combination of fields, such as merchant_id and url, bring them to the PARTITION BY party:

PARTITION BY merchant_id, url /* Inviting merchant_id and url to the party, let's see who's the uninvited guest (duplicate!) */

Avoiding Data Oopsy-Daisy Moments

Before playing with the big red button (aka DELETE), backup your dataset. Consider running a harmless SELECT query first with the same logic to see what damage – ahem – changes you're about to make.

Spotting Distinct Records

Perhaps you're more interested in finding distinct records rather than declaring war on duplicates. The DISTINCT keyword in SELECT statements comes to your rescue:

SELECT DISTINCT column_to_dedupe FROM table_name; /* Oh, look at all these unique snowflakes! */

Preparing for the Unknown

In the unlikely event of things going south, make sure you're prepared. Learn how to use transactions to safely apply changes:

BEGIN; -- Your DELETE wonders ROLLBACK; /* Just Paranoid */ /* If things look suspicious, act like a time traveler and go back! */

Harnessing the Power of Query Planning

PostgreSQL's query planner is no less than an optimization superhero for your deduplication queries. To unveil the magic, turn on your EXPLAIN vision:

EXPLAIN WITH duplicates_cte AS ( SELECT id, ROW_NUMBER() OVER (PARTITION BY column_to_dedupe ORDER BY id) AS rn FROM table_name ) DELETE FROM table_name WHERE id IN (SELECT id FROM duplicates_cte WHERE rn > 1); /* Let's peek into PostgreSQL's master plan! */

Maintaining a Duplicate-Free Environment

To make your database a no-go zone for future duplications, consider database constraints such as UNIQUE indexes. They enforce uniqueness on the column combinations that matter to you.