Delete all Duplicate Rows except for One in MySQL?

sql

delete-duplicates

mysql-error-1093

data-integrity

byAnton Shumikhin·Aug 21, 2024

To eliminate duplicates efficiently, you can use a self-join deletion along with a grouping subquery. Here's a concise solution:

DELETE dup FROM your_table dup
JOIN (
    SELECT MIN(id) as keepId FROM your_table GROUP BY unique_column
) AS original ON dup.id > original.keepId
WHERE dup.unique_column = original.unique_column;

Replace your_table with the name of your table, unique_column with the column responsible for duplication, and id with the unique identifier. The script selects the lowest id (the keeper) for each duplicate group and then deletes entries with higher id values, ensuring one unique record per duplicate group.

Which Approach to Pick for Large Tables?

For bulky datasets, the speed of execution is significant. An alternate approach that could be quick and safer involves using INSERT INTO a new table with SELECT DISTINCT. This strategy creates a new table with unique records only:

CREATE TABLE your_table_unique AS
SELECT DISTINCT * FROM your_table;

Swap the old table with this new distinct one:

RENAME TABLE your_table TO your_table_old, your_table_unique TO your_table;

Remember, always test this approach on a backup copy to avoid any data-integrity issues.

Version Matters - Efficiency Variation in MySQL

MySQL versions can impact the efficiency of de-duplication operations. Make sure to test with a staging environment to prevent performance issues in your production setup.

Handling MySQL Error 1093

You may encounter MySQL error 1093 while deleting duplicates. The workaround is to use a subquery with an extra SELECT layer:

DELETE FROM your_table WHERE id NOT IN (
    SELECT * FROM (
        SELECT MIN(id) FROM your_table GROUP BY unique_column
    ) AS subquery
);

This structure keeps distinct records intact, letting MySQL handle the duplicates smoothly.

Don't Just Delete! Always Test First

Prior to executing any deletion on the main table, ensure to test it on a cloned table first. Remember, prevention is better than cure...or rather data retrieval in this case!

More Ways to De-duplicate

Forget Self-Join, Try `GROUP BY` and `HAVING`

For those who frown upon self-joins, grouping and filtering to the rescue:

DELETE your_table FROM your_table
LEFT JOIN (
    SELECT MAX(id) as lastId FROM your_table GROUP BY unique_column
) AS filtered ON your_table.id = filtered.lastId
WHERE filtered.lastId IS NULL;  -- This NULL is having serious identity problems.

This command preserves the records with the highest id and deletes the rest.

Staging Area with Temporary Tables

Break a leg and use the temporary tables stage:

CREATE TEMPORARY TABLE temp_table AS
SELECT MIN(id) as minId, unique_column FROM your_table GROUP BY unique_column;

DELETE your_table FROM your_table
JOIN temp_table ON your_table.id > temp_table.minId
AND your_table.unique_column = temp_table.unique_column; -- "I'm not a clone!"

User Variables - The Unconventional Way

Get innovative and target sequential duplicates using user variables:

DELETE your_table FROM (
    SELECT id, @rownum := IF(@prev = unique_column, @rownum + 1, 1) as rownum, @prev := unique_column
    FROM your_table, (SELECT @rownum := 0, @prev := NULL) r
    ORDER BY unique_column, id
) dup
WHERE dup.rownum > 1;  -- "I'm feeling so lonely without my duplicates."