Explain Codes LogoExplain Codes Logo

Mysql remove duplicates from big database quick

sql
deduplication
performance-optimization
mysql-queries
Anton ShumikhinbyAnton Shumikhin·Sep 19, 2024
TLDR

To purge duplicates from your MySQL database, transfer unique records to a new table then switch it with the original. Check out this condensed process:

CREATE TABLE table_no_dupes AS SELECT DISTINCT * FROM original_table; RENAME TABLE original_table TO old_table, table_no_dupes TO original_table;

This approach uses MySQL's DISTINCT clause to eliminate duplicates and RENAME for a swift swap with minimum downtime. Don't forget to take a backup first, and adjust the DISTINCT selection as required.

In-depth strategies for efficient deduplication

Using distinctive features to avoid group by and distinct pitfalls

In a large dataset, utilizing GROUP BY or SELECT DISTINCT can lead to unnecessarily long execution times. Instead, we're gunning for a method that can deal with big data efficiently.

Harnessing unique constraints for quick deduplication

The key here is using unique constraints. By creating a temporary table with unique indexes, we can make sure that only unique data survives. Here's a quick example -

ALTER IGNORE TABLE original_table ADD UNIQUE INDEX idx_unique (text1, text2);

Now,.Alter says, "I can't ignore you because you're UNIQUE!" 😉

Retaining non-NULL values with INSERT ON DUPLICATE KEY UPDATE

Sometimes important data can be mixed with NULLs in duplicate rows. Prioritizing non-NULL over NULL values is necessary. Here's a method to do so with our friend IFNULL() function and INSERT INTO ... ON DUPLICATE KEY UPDATE :

INSERT INTO table_no_dupes (text1, text2, text3) SELECT text1, text2, IFNULL(text3, 'default value') FROM original_table ON DUPLICATE KEY UPDATE text3 = VALUES(text3);

This way, the NULL values step aside, and we're like - "Thank you NULL, but your services are no longer required!" 😅

Temporary tables and performance considerations

For larger tables, MySQL may use the filesort algorithm to handle the considerable temporary table. While necessary, optimizing this operation by adding indexes on relevant columns can ease the process.

Advanced tactics for optimized deduplication

Keeping the unique with INSERT IGNORE

The INSERT IGNORE command can quietly discard rows that would duplicate a unique index entry, keeping your operation smooth:

INSERT IGNORE INTO temp_table SELECT * FROM original_table;

Just like that, duplicates magically disappear! Now where did I put my magic 🎩 and 🐇?

Deleting in batches for efficiency

Instead of picking off duplicates one by one, consider a batch deletion of duplicate records. This operation, especially coupled with an indexed column, improves efficiency:

DELETE original FROM original_table AS original JOIN ( SELECT MAX(id) AS max_id FROM original_table GROUP BY unique_column HAVING COUNT(unique_column) > 1 ) dup ON original.id = dup.max_id;

Think of it as a one-click “delete duplicates” button for your database.

Indexes: Adding and subtracting for optimal performance

It's important to manage your indexes properly. Creating effective indexes before you start expunging duplicates enhances your query's performance. Conversely, to avoid performance degradation during normal operations, unnecessary indexes should be removed post-deduplication.