Explain Codes LogoExplain Codes Logo

Deleting duplicate rows from sqlite database

sql
database-normalization
performance-optimization
data-modeling
Nikita BarsukovbyNikita Barsukov·Aug 25, 2024
TLDR

Let's delete duplicates from an SQLite table using a subquery with GROUP BY. This identifies the first instance of each duplicate:

DELETE FROM your_table WHERE rowid NOT IN ( SELECT MIN(rowid) FROM your_table GROUP BY column_to_deduplicate );

This command omits rows that contain the lowest rowid for each unique value in your column_to_deduplicate. It leaves behind only one version of each record per unique value. Remember to replace column_to_deduplicate with the actual column containing duplicate values you wish to reduce.

Effective handling of duplicate data

Preventing duplicates: The proactive approach

Remember the saying, "Prevention is better than cure"? SQLite has got you covered! To avoid duplicates in the future, you can attach a unique constraint to the columns you want to avoid duplicating.

CREATE UNIQUE INDEX idx_column_uniq ON your_table(column_to_deduplicate);

This unique index won't allow you to add new duplicate rows. If tried, SQLite will stop you dead in your tracks. No more late-night sweep-ups!

Cleaning large-scale mess: New table creation

Working with a large dataset and worried about performance? Why not create a new table and populate it with the unique rows from your old table?

CREATE TABLE new_table AS SELECT DISTINCT * FROM your_table; DROP TABLE your_table; ALTER TABLE new_table RENAME TO your_table;

Performing these operations all at once can minimize I/O operations and lock-up times, preventing any transaction hangovers. SQLite is efficient - only if used wisely!

Guided handling of rowid and autoincrement

Understanding rowid behavior is crucial. SQLite gives every row an intrinsic rowid, that is unless you want a specific PRIMARY KEY. Be careful not to sprinkle AUTOINCREMENT everywhere as it can hamper performance. Don't take my word for it - check the SQLite documentation on rowid.

Sage tips for duplicate identification

Kick-starting with GROUP BY and a solitary column is a good start, but many times correct identification of duplicates might require a bit more elbow grease. Combining fields becomes your go-to method when unique identification can't rely on a single column:

DELETE FROM your_table WHERE rowid NOT IN ( SELECT MIN(rowid) FROM your_table GROUP BY column1, column2 );

With GROUP BY incorporating multiple columns, you're all set to respect the combination of values for determining uniqueness.

Performance optimization tricks

On larger tables, take a peek at EXPLAIN QUERY PLAN before running deletion queries to be aware of the performance costs. For cases when performance prefers being elusive, adopting incremental deletion strategies or specific optimizations to steer clear of the ever-so-annoying long-running transaction can come in handy.

Looking into the crystal ball: Future-proofing your data model

A well-normalized data model can help keep duplicates at bay and transform your hairy bug-ridden database into a lean, mean and green data machine! Take some time to learn about database normalization to ensure a future filled with clean and efficient databases.