Explain Codes LogoExplain Codes Logo

How to select records without duplicate on just one field in SQL?

sql
distinct
aggregate-functions
subqueries
Nikita BarsukovbyNikita Barsukov·Dec 26, 2024
TLDR

In SQL, to eliminate duplicates based on a single field, you can use a window function such as ROW_NUMBER(). This strategy is compatible with various SQL database systems:

SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id) as rn FROM users ) as UniqueUsers WHERE rn = 1;

This query partitions the data using the name field and orders it by id to maintain result consistency. Then it only includes the first row from each group, effectively removing duplicates based on the name.

Aggregate functions: the uniqueness champions

For a table fraught with multiple records per unique identifier, using GROUP BY with aggregate functions creates a distinct set of rows:

SELECT MIN(id) AS id, name -- selecting the "least Id-ed" hero from the group FROM users GROUP BY name; --band together by name, Avengers style!

By resorting MIN(id), we ensure getting the smallest id for each distinct name. An aggregate function is applied to the field not being grouped to avert arbitrary selections by SQL.

DISTINCT: an efficient yet tricky tool

Using the DISTINCT keyword gets you unique field values, but it doesn't quite cut it for complete row-level uniqueness. Here's why:

SELECT DISTINCT name, OTHER_FIELD -- like ordering a combo meal, but you're getting a distinct combo FROM users;

In this scenario, DISTINCT operates on the collective uniqueness of all the selected fields, not individual ones. If OTHER_FIELD has varied values for the same name, expect all those combinations in your results.

The power of subqueries multiplied with CASE

Subqueries come to the rescue when dealing with duplicates:

SELECT * FROM users u1 WHERE id = ( SELECT MIN(u2.id) -- only, and I repeat, ONLY the "min" id gets to be in the cool squad FROM users u2 WHERE u2.name = u1.name );

Additionally, the CASE statement can be neatly tucked inside a subquery to mark duplicates and steer them away from the final result.

Say no to redundancy, say yes to efficiency

By avoiding results redundancy, you're not just ensuring accurate data, but you're also banking on efficient query processing. The ROW_NUMBER() function coupled with PARTITION BY can be your trusty tool:

SELECT id, name FROM ( SELECT id, name, ROW_NUMBER() OVER (PARTITION BY name) as rn -- assigning roll numbers to avoid chaos in class! FROM users ) as RankedUsers WHERE rn = 1; -- only the top performers get the medal

By assigning a row number to each name and including only the first occurrence, the uniqueness of your name field remains unchallenged.

The EXISTential approach: because duplicates don't Exist, or do they?

Employing the EXISTS clause in your arsenal can filter unique rows:

SELECT * FROM users u1 WHERE EXISTS ( -- if it exists, it matters SELECT 1 FROM users u2 WHERE u1.name = u2.name GROUP BY u2.name HAVING COUNT(*) = 1 -- because we believe in oneness! );

This returns rows for which there's only one instance of name in the database, thus ensuring distinctiveness.

Pacing and spacing: tackling sequential duplicates

For datasets where duplicates follow each other, LAG or LEAD functions can be your best buddies:

SELECT id, name FROM ( SELECT id, name, LAG(name) OVER (ORDER BY id) AS prev_name -- tracking the one who led FROM users ) AS RaceTrack WHERE name <> prev_name OR prev_name IS NULL; -- only unique racers allowed, no twins please!