Explain Codes LogoExplain Codes Logo

To ignore duplicate keys during 'copy from' in PostgreSQL

sql
data-integrity
database-design
postgresql
Alex KataevbyAlex KataevยทDec 30, 2024
โšกTLDR

Conquer duplicates on COPY FROM in PostgreSQL by first moving data into a temporary table, then selectively inserting it into your target table using ON CONFLICT DO NOTHING:

CREATE TEMP TABLE temp_no_dupes LIKE actual_table; -- because creating things feels cool! ๐Ÿ˜Ž COPY temp_no_dupes FROM 'source.csv' CSV; -- just like copy-pasting your homework, but getting no penalty ๐Ÿ‘€ INSERT INTO actual_table SELECT * FROM temp_no_dupes ON CONFLICT DO NOTHING; -- it's like dodging country-side traffic โš ๏ธ

This approach makes sure only unique records find a way to your final table, skillfully bypassing any key turbulence.

Early bird catches the worm: Remove duplicates before inserting

Before you launch the assault to insert into the main table, you're better off cleaning your data off duplicates. Postgres might not have a built-in 'IGNORE' or 'REPLACE', but we can stage a convincing act.

CREATE TEMP TABLE temp_table AS SELECT * FROM main_table WITH NO DATA; -- Isn't it cool we shape-shifted one table into another? ๐Ÿ˜ COPY temp_table FROM 'file.csv' CSV; -- Like moving files from one folder to another DELETE FROM temp_table USING ( SELECT MIN(ctid) as ctid, PK_field FROM temp_table GROUP BY PK_field HAVING COUNT(*) > 1 ) sub WHERE temp_table.ctid != sub.ctid; -- You can't clone yourself. Why let your records do so, huh? INSERT INTO main_table SELECT * FROM temp_table ON CONFLICT (PK_field) DO NOTHING; -- Dodge conflict like a pro again! โšก๏ธ

All problems solved: Upsert to the rescue

In those ticklish scenarios where you need to update existing records rather than merely jumping over duplicates, make the ON CONFLICT clause your best friend for an upsert operation:

INSERT INTO main_table SELECT * FROM temp_no_dupes ON CONFLICT (PK_field) DO UPDATE SET column_1 = EXCLUDED.column_1, column_2 = EXCLUDED.column_2; -- "Mirror mirror on the wall, am I the most updated of them all?" ๐Ÿชžโœจ

This strategy basically whispers "If you come across a duel on the primary key, appease it by updating the existing record with the new values from the temp table".

Your choice matters: How to decide what to do with duplicates

To vote for which duplicates to keep, you can cleverly order your SELECT DISTINCT ON:

INSERT INTO main_table SELECT DISTINCT ON (PK_field) * FROM temp_no_dupes ORDER BY PK_field, specific_column DESC;

Postgres will thoughtfully keep the row with the highest value in specific_column for any duplicate PK_field.

Preventing duplicates: Better safe than sorry

Avoid future hair-pulling by designing your import process and table constraints to nip duplication in the bud:

  • Making smart use of unique constraints and indexes.
  • Keeping your data source clean and verified on regular basis.
  • Opting for incremental loads armed with timestamps or log sequences when applicable.

Expert tips: Perform under different scenarios

Different scenarios call for different battle plans for dealing with duplicates. Here are some tips to ensure data safety and smooth operation:

  • Decide on your tactics: Devise specific indexes to accelerate conflict detection and resolution.
  • Be vigilant about the source file: Clean it before the COPY to nip unnecessary workload in the bud.
  • Ready your weapons: Write data cleaning queries to scout for potential issues before the import.
  • Mind the battlefield conditions Use transactions (BEGIN/COMMIT) to maintain the integrity of your operations and implement batch processing for ultra-large datasets.
  • Carry out victory check: Run queries post-import to verify counts and possibly checksums of data integrity. A simple SELECT COUNT(*) can guarantee your success.