Mysql Select Records for Duplicates Using Multiple Columns

sql

prompt-engineering

join

performance

byAlex Kataev·Nov 20, 2024

A handy way to identify duplicate rows across col1, col2 is to do a GROUP BY across these columns and filter the result using HAVING COUNT(*) > 1 to highlight occurrences more than once:

SELECT col1, col2, COUNT(*) as duplicates
FROM my_table
GROUP BY col1, col2
HAVING duplicates > 1;

This experiment will yield duplicate count for the col1, col2 pairs.

Handling NULL values like a pro

Dealing with NULL values is an art. When using SQL queries, it's important to handle NULL values wisely as they can distort the comparison results. In a GROUP BY operation, the NULL values in columns are considered individual entities. To consider the NULL as an equivalent for grouping purposes, use the NULL-safe equal-to operator <=> along with COALESCE function:

// Dealing with NULLs: SQL edition. (Also, who named it NULL and not "ghost value"? Missed opportunity!)
SELECT col1, col2, COUNT(*)
FROM my_table
GROUP BY COALESCE(col1, 'constant'), COALESCE(col2, 'constant')
HAVING COUNT(*) > 1;

The COALESCE function steps in to replace NULL with a constant value, making NULL comparisons a breeze.

Beating duplicates at their own game: Get all duplicate rows

To catch all duplicate rows sneaking around, self-join the original table with the GROUP BY query results:

// SQL: Join me in capturing sneaky duplicates!
SELECT a.*
FROM my_table a
JOIN (
    SELECT col1, col2
    FROM my_table
    GROUP BY col1, col2
    HAVING COUNT(*) > 1
) b ON a.col1 = b.col1 AND a.col2 = b.col2;

This reveals all the duplicate rows, and not just the grouped counts.

Performance tuning: Make your queries fly!

When hunting down duplicates, speed is your best ally. Choose joins over subqueries and blended UNION operations, because joins are speedier and more scalable when dealing with large datasets. Additionally, turning on indexed columns used in GROUP BY can supercharge your query performance.

Advanced tricks: Taming edge cases

Identical vs. Similar: Not all duplicates wear capes

Sometimes you need to consider similar records as duplicates. Fret not, range-based grouping or fuzzy matching techniques can help. Use functions like DATEDIFF for dates and rounding numeric values:

// Who needs Precision when you got SQL!
SELECT col1, ROUND(col2, 1) as rounded_col2, COUNT(*)
FROM my_table
GROUP BY col1, ROUND(col2, 1)
HAVING COUNT(*) > 1;

Complex duplicates - Sherlock mode

When there's more to duplicates than meets the eye, or the criteria is more than just direct column comparisons, functions or expressions in the GROUP BY clause come to the rescue.

SELECT CONCAT(col1, '-', col2) as complex_criteria, COUNT(*)
FROM my_table
GROUP BY complex_criteria
HAVING COUNT(*) > 1;

Post-processing: What's next?

Once the duplicates are caught, you may need to flag or de-duplicate them. SQL provides tools like window functions (e.g., ROW_NUMBER) for such post-processing.

SELECT a.*, 
       CASE 
           WHEN ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY id) = 1 
           THEN 'original' 
           ELSE 'duplicate' 
       END as dupe_status
FROM my_table a;