Explain Codes LogoExplain Codes Logo

How to delete duplicate rows without unique identifier

sql
delete-duplicates
postgresql
ctid
Nikita BarsukovbyNikita Barsukov·Oct 27, 2024
TLDR

Get rid of duplicates swiftly using a CTE (Common Table Expression) and ROW_NUMBER(). Here's the magic SQL spell:

WITH CTE AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY dup_cols ORDER BY (SELECT NULL)) AS rn FROM your_table ) DELETE FROM CTE WHERE rn > 1;

In dup_cols, specify the columns that identify duplicates and replace your_table with your table's name. This code will reserve only one row for each set of duplicates by deleting rows with a row number above 1, which the CTE allots within groups of duplicates.

Different battle tactics against the Evil Duplicates

The secret weapon in PostgreSQL: ctid

In the PostgreSQL realm, there's a concealed column called ctid that can act as a pseudo-unique identifier:

DELETE FROM your_table WHERE ctid NOT IN ( SELECT MAX(ctid) -- the "chosen one" of each duplicate set FROM your_table GROUP BY your_columns_here );

Can you imagine if every SQL table is a Hogwarts class? The ctid would be the "chosen one".

Sneak peek at the enemy before the battle

To preview your duplicates before making them vanish, exchange DELETE with SELECT:

SELECT * FROM your_table WHERE ctid NOT IN ( SELECT MAX(ctid) FROM your_table GROUP BY your_columns_here );

Just to make sure you're not about to vanish your favorite rows.

Dealing with the sneaky NULL values using COALESCE

When dealing with NULL values in columns, don't forget to call upon COALESCE() so that they are taken into account equivalently:

DELETE FROM your_table WHERE ctid IN ( SELECT ctid FROM ( SELECT ctid, COALESCE(column_with_nulls, 'default_value') OVER (PARTITION BY column_with_nulls ORDER BY (SELECT NULL)) AS rn FROM your_table ) dups WHERE rn > 1 );

This way, NULL values can't play hide and seek with us anymore.

Perform like a SQL superhero

For performance, when dealing with a small army of duplicates, using the ctid or JOIN with USING is more efficient:

DELETE FROM your_table USING your_table yt2 WHERE your_table.ctid > yt2.ctid AND your_table.duplicate_column = yt2.duplicate_column;

Sometimes, less is more, and using USING over NOT IN (...) subqueries proves it.

Pro tips: Avoid stepping on the rakes

The power of unique identifiers

For safeguarding your data integrity in the future, consider adding a unique, auto-incrementing id to tables lacking a natural unique identifier. This brings easy tracking and managing duplicates to another level.

Using CTEs: The good, the bad, and the risky

CTEs are marvellous for weeding out duplicates, but remember that DELETE operations aren't allowed in the secret clubhouse of WITH in all databases. A notable exception is: Netezza.

DELETE and WITH clause: Not a Love Story

In certain databases, like Netezza, you just can't use DELETE immediately after a WITH clause, though PostgreSQL is cool with it. Always check compatibility with your SQL database version.