Remove duplicate rows in MySQL

sql

prompt-engineering

best-practices

join

byAlex Kataev·Oct 10, 2024

Dealing with duplicate rows in MySQL is no more a hassle. Use a deduplication query based on distinctive column values. An INNER JOIN, when combined with a subquery, removes extra entries leaving only the first one:

DELETE dup FROM your_table dup
INNER JOIN (
    SELECT MAX(id) AS latest_id FROM your_table
    GROUP BY column_to_deduplicate
    HAVING COUNT(*) > 1
) AS grp ON dup.id < grp.latest_id
AND dup.column_to_deduplicate = grp.column_to_deduplicate;

Substitute your_table and column_to_deduplicate with your table and column names. This eliminates duplicates, preserving the row with the highest ID for each unique column value.

Preemptive measures: Avoiding future duplicates

To block future duplicates, apply ALTER TABLE ADD UNIQUE INDEX to establish a unique index on selected columns:

ALTER TABLE your_table ADD UNIQUE INDEX index_name (column_to_deduplicate);

Remember, back up your database before making structural changes for safekeeping. It's also good practice to go through MySQL 5.7 release notes to avoid deprecated methods.

Handling special cases: Advanced de-duplication strategies

Getting real with NULL values

In case of duplicates with NULL values, the NULL-safe equal operator <=> comes to your rescue:

DELETE t1 FROM your_table t1
INNER JOIN your_table t2 
WHERE t1.id < t2.id
AND (t1.column_to_deduplicate <=> t2.column_to_deduplicate);

It ensures duplicates including rows with NULL values are booted out of your data.

Clean slate approach: Swapping tables

Start fresh by creating a new table with similar structure and adding a UNIQUE index:

CREATE TABLE new_table LIKE your_table;
ALTER TABLE new_table ADD UNIQUE INDEX index_name (column_to_deduplicate);
-- Coding magic happening, please do not disturb

Transfer distinct rows:

INSERT INTO new_table 
SELECT * FROM your_table 
GROUP BY column_to_deduplicate;

Swap and drop is the final step:

RENAME TABLE your_table TO old_table,
             new_table TO your_table;
DROP TABLE old_table;

This method ensures a table free of duplicates plus a pre-existing unique constraint.

Tidy up: Managing temporary tables

Using temporary tables proves beneficial for complex deduplication tasks:

CREATE TEMPORARY TABLE temp_ids AS
SELECT MIN(id) as keep_id
FROM your_table
GROUP BY column_to_deduplicate
HAVING COUNT(*) > 1;
-- Creating a 'HALL OF FAME' for 'keep_id's

Then let's use this table to get rid of duplicates:

DELETE FROM your_table
WHERE id NOT IN (SELECT keep_id FROM temp_ids);
-- "I'm sorry, id, you're not in the list. You've to go!"

Remember to dispose of these temporary tables after use to reclaim space and keep the database environment sparkling clean.

Mastering scale: Optimizing for large datasets

When dealing with sizable data, being scalable can save your day:

Revving up delete operations

For a turbo-boost, swap the self-join with a capacity-efficient DELETE JOIN:

DELETE your_table FROM your_table
JOIN temp_ids ON your_table.id = temp_ids.duplicate_id;
-- "Join the party, or leave the table!"

This can significantly speed up the de-duplication operation.

On-fly duplicate resolution

Handle duplicates in real-time during insert operations:

INSERT INTO your_table (column1, column2, ...)
VALUES (value1, value2, ...)
ON DUPLICATE KEY UPDATE column1 = value1, column2 = value2, ...;
-- "Duplicate row detected. Aborting mission. Initiating update."

This automatically keeps the dataset tidy — fewer dishes to wash after dinner!

Fresh start with IDs

When you want to refresh ids after cleaning up, leave out auto-increment fields:

INSERT INTO your_table (column_to_deduplicate)
SELECT DISTINCT (column_to_deduplicate) FROM old_table;
-- "Love means never having to say 'you're a duplicate'."

This operation is handy when ids have to be resynchronized post cleaning.

Non-duplicate charm

Go beyond the IGNORE limitations and trend towards inherent deduplication in MySQL, designing strategies to fit your data. This can include group-by tricks, neatly set up intermediate tables with unique constraints, and velocity-optimized joins.

explain-codes / Sql / Remove duplicate rows in MySQL

Linked

Mysql remove duplicates from big database quick



Deleting duplicate rows from sqlite database



Delete all Duplicate Rows except for One in MySQL?



Only inserting a row if it's not already there



Mysql: How to copy rows, but change a few fields?

