Finding duplicate values in a SQL table

sql

database-systems

group-by-clause

data-integrity

byAlex Kataev·Oct 26, 2024

Detect duplicate rows based on name and email using straightforward GROUP BY:

-- We're on a quest to find the clones...
SELECT name, email, COUNT(*)
FROM your_table
GROUP BY name, email
HAVING COUNT(*) > 1;  -- "There's only room for one in this town..."

To remove duplicates while preserving one copy, create a ROW_NUMBER() within a CTE and delete accordingly:

-- CTE to the rescue!
WITH CTE AS (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY name, email ORDER BY id) AS rn
    FROM your_table
)
DELETE FROM CTE WHERE rn > 1;  -- "Sayonara, duplicates!"

Always backup your data first. Wrapping deletions within transactions is a smart move to maintain a healthy database avoiding unintentional data loss.

Surviving the `GROUP BY` Maze

When you dive into different database systems, each comes with its nuances while dealing with the GROUP BY clause and handling duplicates. Here are key takeaways for major players:

PostgreSQL: It's forgiving, a GROUP BY on just the dependent columns will do.
MySQL: You need to be more precise. The ONLY_FULL_GROUP_BY setting ensures all selected fields appear in the GROUP BY.
SQL Server: It insists all non-aggregated selected columns be in the GROUP BY.
Oracle: A bit unconventional, it may require a slightly different approach.

Mastering the Art of Deletion

Sometimes blanket removal of duplicates can be risky. Check below to tailor your deletion strategies:

Safe Deletion with Primary Key Avoid collateral damage - always use the primary key to identify rows for deletion.

-- Very specific, like a hired assassin...
DELETE FROM your_table
WHERE id IN (
    SELECT id FROM (
        SELECT id, ROW_NUMBER() OVER(PARTITION BY name, email ORDER BY id) AS rn
        FROM your_table
    ) temp
    WHERE rn > 1
);

Selective Deletion Not all duplicates are created equal. Carefully review them before deciding which to strike off.
Multi-Field Duplicates If duplicates exist across multiple fields, just widen the PARTITION BY clause.

Fortifying Data Integrity

Preventing duplicates before they creep in is better than chasing them down later. Let's arm ourselves for that:

Unique Constraints: Erect barriers at the outset. Use unique constraints or indexes to bar duplicate entries.
Normalization: Adhere to database normalization principles to structure your tables so they deter redundancies.
Insertion Logic: Cook up logic in your data layer to prevent feeding duplicates.