Delete duplicate rows from small table

sql

delete-duplicates

postgresql

data-management

byAnton Shumikhin·Oct 6, 2024

To eliminate duplicate entries in a small SQL table like ninja swipes out enemies, consider using CTE (Common Table Expressions) with ROW_NUMBER(). This approach assigns a numeric rank to every row in duplicate groups, permitting targeted removal:

-- Quick solution for small tables; like a ninja with shurikens
WITH RankedDuplicates AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY unique_column ORDER BY id) AS rn
  FROM table_name
)
DELETE FROM RankedDuplicates WHERE rn > 1;  -- There can be only one! (Highlander reference 🗡️)

unique_column should be replaced with the column you're deduplicating, and swap table_name with your actual table name. This process retains the first occurrence of duplicates and removes the rest.

Approach breakdown: How-to guides for various scenarios

Take pre-emptive action with UNIQUE constraints

To actively discourage new duplicates like a schoolyard monitor, add UNIQUE constraints to keep your data clean:

-- Adding unique constraints are like strict teachers in a schoolyard; no funny business allowed!
ALTER TABLE table_name
ADD CONSTRAINT constraint_name UNIQUE(unique_column1, unique_column2);

This exercise is highly effective in warding off new duplicates.

Utilise CTID to efficiently handle duplicates

When dealing with PostgreSQL, consider CTID - the pole vault stick that leaps over duplicates. Use it this way:

-- CTID is like your personal pole vaulter that easily leaps over duplicates
DELETE FROM table_name
WHERE ctid NOT IN (
  SELECT MIN(ctid)
  FROM table_name
  GROUP BY unique_column
);

The ctid field is a unique row identifier and a particularly handy tool in PostgreSQL for deleting duplicates.

Large dataset? No worries

On encountering a large table, consider creating a new one housing distinct entries - like vacuuming only the wanted particles:

-- For large tables, creating a new one with distinct entries is like Thanos' snap; half of the duplicates disappear
CREATE TABLE new_table AS
SELECT DISTINCT ON (unique_column) *
FROM table_name;

After which, rename the new_table to replace the old. It's efficient, especially if your original table is a hot mess of duplicates.

Hot potato: dealing with cases without keys

Addressing a table with no keys

If your table started life sans keys, panic not. EXIST operators or NOT EXISTS can come to your rescue in DELETE statements:

-- This is like using a metal detector to find hidden treasures
DELETE FROM table_name t1
WHERE EXISTS (
  SELECT 1
  FROM table_name t2
  WHERE t2.unique_column = t1.unique_column
  AND t2.id < t1.id
);

Adding keys post-design

Missing primary keys from the onset? No sweat! Add them later for smoother data manipulation:

-- Adding primary key later is like giving a name to a pet; they feel part of the family now!
ALTER TABLE table_name ADD COLUMN id SERIAL PRIMARY KEY;