Delete duplicate records in SQL Server?

sql

data-integrity

database-management

sql-queries

byAlex Kataev·Oct 25, 2024

To swiftly deal with duplicates in an SQL Server table, CTEs (Common Table Expressions) and ROW_NUMBER() are your best pals. Assign row numbers to rows, grouped by the duplicated columns, then exterminate those with a row_num > 1.

Here's a quick run-down:

-- Grab me a biohazard suit! Time to purge some duplicates!
WITH Dupes AS (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY DuplicateColumn ORDER BY Id) AS row_num
  FROM
    TableName
)
DELETE FROM Dupes WHERE row_num > 1;

Just replace TableName with your table name, DuplicateColumn with the duplicate field, and Id with your unique identifier. To infinity and beyond!

Quick glimpse: What are you deleting?

Don't just dive in! Always preview the records you're about to zap. Trade the DELETE command with a SELECT statement:

-- Can we get a quick roll call of the duplicates?
SELECT * FROM Dupes WHERE row_num > 1;

Let your eyes feast on the duplicates before they wave goodbye. If only we could've got to know them better!

Tackling complex data types

Got binary data or GUID columns at hand? When using our good ol' pal MIN(), don't forget to cast the binary data:

-- Binary data? Who you gonna call? Ghostbusters!
SELECT MIN(CAST(GuidColumn AS BINARY(16))) FROM TableName GROUP BY DuplicateColumn;

This avoids freaky mishaps that we all hate!

Maintaining data integrity

A nameless king leads a doomed kingdom! Always know your data as well as Sansa Stark knows the North. Apply constraints or unique indexes to keep duplicates as far away as the Wall:

-- Uniqueness shall pass!
ALTER TABLE TableName ADD CONSTRAINT UC_TableName UNIQUE (UniqueColumn);

Managing data the efficient way

Managing large tables is harder than taming dragons. CTEs combined with deletion are the Dragonglass that works wonders against massive data tables:

-- CTE and delete, the Game of Thrones power couple!
WITH CTE AS ( /* your magical code here */ )
DELETE FROM CTE WHERE /* here be dragons */;

Mastering the self-joins and NOT IN clause

Another wrench in your tools is a self-join or NOT IN clause with a subquery to fend off those pesky duplicates:

-- "NOT IN" to the rescue!
DELETE FROM TableName
WHERE Id NOT IN (
  SELECT MIN(Id) FROM TableName GROUP BY DuplicateColumn
);

This strategy fits like a glove when you have a unique identifier for grouping duplicates.

Maximizing efficiency: Deleting with conditionals

When deleting, using MAX() can keep the latest entry, assuming your dataset is a drama queen who loves conditions:

-- MAX(): Because sometimes, size does matter!
WITH RankedDuplicates AS (
  SELECT
    MAX(Id) OVER (PARTITION BY DuplicateColumn) AS KeepId,
    Id
  FROM
    TableName
)
DELETE FROM TableName
WHERE Id NOT IN (SELECT KeepId FROM RankedDuplicates);