Explain Codes LogoExplain Codes Logo

How to remove duplicates, which are generated with array_agg postgres function

sql
array_agg
distinct
postgresql
Nikita BarsukovbyNikita Barsukov·Sep 20, 2024
TLDR

Keep duplicates at bay in array_agg by flexing your DISTINCT muscle with a trusty subquery. Here is a simple way to disarm replication:

SELECT id, ARRAY_AGG(DISTINCT value) FROM your_table GROUP BY id;

With DISTINCT tucked inside ARRAY_AGG, you get to enjoy a per id assembly of unique elements. No more, no less.

Need-to-know solutions for different scenarios

Merging columns before aggregation

When your task at hand involves aggregating composite values, you will need to befriend a subquery and team it up with DISTINCT:

-- Because why should names have all the fun? Let's party, full names! SELECT company_id, ARRAY_AGG(DISTINCT full_name) AS distinct_full_names FROM ( SELECT company_id, first_name || ' ' || last_name AS full_name FROM employees ) subquery GROUP BY company_id;

This ensures that only the unique names show up at the party...I mean the aggregated array, avoiding gatecrashers that answer to the same multiple columns.

Adroit use of string functions

With ARRAY_TO_STRING in your PostgreSQL toolbox, you can morph an array into a string, neatly segmented by a given separator. When used on a distinct aggregated array, it whips up a tidy, comma-delineated listing:

-- Who doesn't love a clean string spelling out names with quaint commas? SELECT company_id, ARRAY_TO_STRING(ARRAY_AGG(DISTINCT full_name), ', ') AS distinct_names FROM employees GROUP BY company_id;

It dishes out a string of unique names, hitched together by commas for every company_id.

Staying vigilant during table joins

When your task involves table joins, remember the old wisdom that good fences make good neighbors. In a similar vein, a well-defined join can curb duplicates. Always double-check your joins and choose appropriate columns to group by:

-- IYKYK, the party's in group BY SELECT e.company_id, ARRAY_AGG(DISTINCT e.full_name) FROM employees e JOIN departments d ON e.department_id = d.id GROUP BY e.company_id;

This ensures you are not accumulating duplicates unwittingly planted by the join operation.

Taming wild data with WHERE clauses

Data is like a wild mustang that roams freely in canyons. Sometimes, the duplication issue is a result of failing to rein in your data using WHERE.

-- Show me only the "ACTIVE" ones, pretty please? SELECT company_id, ARRAY_AGG(DISTINCT full_name) FROM employees WHERE active = true GROUP BY company_id;

Adding WHERE conditions (e.g., active = true) ensures you are only saddling up relevant records.

Snags you may encounter and how to dodge them

Overlooking join conditions

Neglecting the fields used in joins can yield duplicates. An ill-suited or unconstrained field can stealthily sneak duplicates into your aggregated array. Give extra-care to your join conditions and always group by unique values.

Missing out the DISTINCT clause

Leaving out DISTINCT while using array_agg is a common slip that creates repeats. Make DISTINCT your constant companion in your quest for uniquness within arrays.

Misuse of the ARRAY_TO_STRING function

Although ARRAY_TO_STRING can be a handy tool, an unnecessary diversion to it can lead to poorer performing queries. Keep it in the back-burner until the need for a string representation of your array arises.