Explain Codes LogoExplain Codes Logo

How to find duplicate records in PostgreSQL

sql
prompt-engineering
best-practices
join
Anton ShumikhinbyAnton Shumikhin·Sep 30, 2024
TLDR

To highlight duplicates in PostgreSQL, combine GROUP BY with HAVING to single out rows that occur more than once:

SELECT column, COUNT(*) FROM table GROUP BY column HAVING COUNT(*) > 1;

Swap column and table with your specific field and data source. This SQL line detects and flags duplicates efficiently.

Expanding the basic query

Our fast answer works well, but what if duplicates are spread across multiple columns? Adjust your GROUP BY to include all relevant fields:

SELECT column1, column2, COUNT(*) FROM yourTable GROUP BY column1, column2 HAVING COUNT(*) > 1;

For a clearer picture, add unique identifiers to each duplicate group. Bring into play the ROW_NUMBER() window function, in tandem with PARTITION BY:

SELECT *, ROW_NUMBER() OVER(PARTITION BY column ORDER BY id) AS Row FROM yourTable WHERE Row > 1;

Efficient duplicate extermination

Sometimes, you not only need to spot duplicates, but also to wipe them out. Here's how to siege the castle of duplication and keep the oldest prisoner:

DELETE FROM yourTable USING ( SELECT MIN(id) as id, column FROM yourTable GROUP BY column HAVING COUNT(*) > 1 ) AS t1 WHERE yourTable.id > t1.id AND yourTable.column = t1.column;

"Off with their heads!" — Keep in mind this method keeps the earliest record (i.e., the "head" and let go of the rest).

Dealing with multi-field duplicates

Duplication can often occur across multiple pillars of your database—namely, multiple fields. Here is how to tackle this Duplicate Hydra with your SELECT sword:

SELECT id, year, COUNT(*) FROM yourTable GROUP BY id, year HAVING COUNT(*) > 1 AND year = '2023';

In this example, duplicates are grouped both by id and year, slicing only into current year records. Now you have a sharp blade to chop off any multi-field duplicate heads.

Advanced maneuvers for duplicate detection

Situations may arise when you need more precision. You might want a lineup of the culprits or manage them in a set order. Here's your SQL flashlight in the darkness of duplicate dungeons:

WITH DuplicateRecords AS ( SELECT * , ROW_NUMBER() OVER (PARTITION BY column ORDER BY id) AS rn FROM yourTable ) SELECT * FROM DuplicateRecords WHERE rn > 1;

The rn column is now your duplicate detective, catching them in the act and numbering them. Now you can prosecute these dupers individually or in bulk.

Framing the duplicate findings

Lastly, it may prove useful to display your findings. Here's an example of how to do just that:

SELECT column, ARRAY_AGG(id) AS duplicate_ids FROM ( SELECT column, id, ROW_NUMBER() OVER (PARTITION BY column ORDER BY id) AS rn FROM yourTable ) AS t WHERE rn > 1 GROUP BY column;

With ARRAY_AGG, you can assemble all duplicate ids into a neat little line-up, linked to the prime 'id' deputy. Now you're ready to display or interrogate them as you see fit.