Sql query for finding records where count > 1

sql

performance

best-practices

indexing

byAnton Shumikhin·Jan 24, 2025

Retrieve rows with duplicate column values using:

SELECT column, COUNT(*)
FROM table
GROUP BY column
HAVING COUNT(*) > 1;

The GROUP BY clause separates by column, and the HAVING clause filters to only counts greater than one. This approach works efficiently for small to medium-sized datasets. Now, let's explore the advanced querying technique for large datasets, and discuss potential pitfalls you may encounter.

Two-step approach for large datasets

In dealing with massive amounts of data, performance optimization is crucial. A more efficient approach would be a two-step query:

-- the "WINNING" strategy for efficiency, *wink*
WITH DistinctRecords AS (
  SELECT DISTINCT account, user_id, ZIP, date
  FROM payment
)
SELECT account, user_id, COUNT(*)
FROM DistinctRecords
GROUP BY account, user_id
HAVING COUNT(*) > 1;

In this scenario, we optimize by utilizing a common table expression (CTE) to first boil down distinct records.

Self-join for intricate scenarios

For more complex conditions, such as different ZIP codes for the same account, self-join plays a crucial role:

-- That awkward moment when you join a table with itself.
SELECT a.user_id, COUNT(*)
FROM payment a
JOIN payment b ON a.account = b.account AND a.ZIP != b.ZIP AND a.date = b.date
GROUP BY a.user_id
HAVING COUNT(*) > 1;

A self-join allows us to handle intricate scenarios, ensuring we are comparing apple to apple.

Accurate counting of distinct entries

To reflect relevant record counts accurately, try this:

-- Even ZIP codes need their 15 minutes of fame.
SELECT user_id, account, COUNT(DISTINCT ZIP)
FROM payment
WHERE date = '2023-01-01'
GROUP BY user_id, account
HAVING COUNT(DISTINCT ZIP) > 1;

This query considers distinct ZIP codes for the same account and user, all filtered based on a specific date.

Further considerations: Efficiency & Performance

Indexing for speed

Correct indexing provides significant acceleration to GROUP BY operations. An index on the column and join condition columns fetches data rapidly.

Distorted datasets

Data skew—a value massively dominating others—hampers performance. Routinely reviewing and optimizing indexes help maintain blazing performance as your data evolves.

Inspect query execution

Examine your query plan for potential bottlenecks. SQL engines offer EXPLAIN plan providing a roadmap, helping you tackle inefficiencies pro-actively.

explain-codes / Sql / Sql query for finding records where count > 1

Linked

How do I query SQL for the latest record date for each user



Mysql INNER JOIN select only one row from second table



How to group by month using SQL Server?



Filter Table Before Applying Left Join



How to select unique records by SQL

