How to delete duplicate rows without unique identifier

sql

delete-duplicates

postgresql

ctid

byNikita Barsukov·Oct 27, 2024

Get rid of duplicates swiftly using a CTE (Common Table Expression) and ROW_NUMBER(). Here's the magic SQL spell:

WITH CTE AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY dup_cols ORDER BY (SELECT NULL)) AS rn
  FROM your_table
)
DELETE FROM CTE WHERE rn > 1;

In dup_cols, specify the columns that identify duplicates and replace your_table with your table's name. This code will reserve only one row for each set of duplicates by deleting rows with a row number above 1, which the CTE allots within groups of duplicates.

Different battle tactics against the Evil Duplicates

The secret weapon in PostgreSQL: ctid

In the PostgreSQL realm, there's a concealed column called ctid that can act as a pseudo-unique identifier:

DELETE FROM
  your_table
WHERE
  ctid NOT IN (
    SELECT MAX(ctid)  -- the "chosen one" of each duplicate set
    FROM your_table
    GROUP BY your_columns_here
  );

Can you imagine if every SQL table is a Hogwarts class? The ctid would be the "chosen one".

Sneak peek at the enemy before the battle

To preview your duplicates before making them vanish, exchange DELETE with SELECT:

SELECT *
FROM your_table
WHERE ctid NOT IN (
  SELECT MAX(ctid)
  FROM your_table
  GROUP BY your_columns_here
);

Just to make sure you're not about to vanish your favorite rows.

Dealing with the sneaky NULL values using COALESCE

When dealing with NULL values in columns, don't forget to call upon COALESCE() so that they are taken into account equivalently:

DELETE FROM your_table
WHERE ctid IN (
  SELECT ctid
  FROM (
    SELECT ctid, COALESCE(column_with_nulls, 'default_value') OVER (PARTITION BY column_with_nulls ORDER BY (SELECT NULL)) AS rn
    FROM your_table
  ) dups
  WHERE rn > 1
);

This way, NULL values can't play hide and seek with us anymore.

Perform like a SQL superhero

For performance, when dealing with a small army of duplicates, using the ctid or JOIN with USING is more efficient:

DELETE FROM your_table
USING your_table yt2
WHERE your_table.ctid > yt2.ctid
AND your_table.duplicate_column = yt2.duplicate_column;

Sometimes, less is more, and using USING over NOT IN (...) subqueries proves it.

Pro tips: Avoid stepping on the rakes

The power of unique identifiers

For safeguarding your data integrity in the future, consider adding a unique, auto-incrementing id to tables lacking a natural unique identifier. This brings easy tracking and managing duplicates to another level.

Using CTEs: The good, the bad, and the risky

CTEs are marvellous for weeding out duplicates, but remember that DELETE operations aren't allowed in the secret clubhouse of WITH in all databases. A notable exception is: Netezza.

DELETE and WITH clause: Not a Love Story

In certain databases, like Netezza, you just can't use DELETE immediately after a WITH clause, though PostgreSQL is cool with it. Always check compatibility with your SQL database version.

explain-codes / Sql / How to delete duplicate rows without unique identifier

Linked

Delete duplicate rows from small table



Delete duplicate records from a SQL table without a primary key



How to find duplicate records in PostgreSQL



How to delete from select in MySQL?



Delete duplicate records in SQL Server?

