Explain Codes LogoExplain Codes Logo

Does PostgreSQL support "accent insensitive" collations?

sql
unaccent
full-text-search
collations
Nikita BarsukovbyNikita Barsukov·Aug 14, 2024
TLDR

PostgreSQL rolled out the accent-insensitive search feature from version 12, harnessing the power of the ICU extension. Implement it in your query by defining a customized collation:

CREATE COLLATION ci_ai (provider = icu, locale = 'en-US-x-icu', deterministic = false); SELECT * FROM tbl WHERE col COLLATE ci_ai = 'cafe';

This command creates a new accent-insensitive collation ci_ai, which it then deploys to spot rows in the tbl where col has a match for 'cafe', accents completely disregarded.

Query optimization: unaccent and beyond

Unaccent module: Your sword against accents

Boost your query's flexibility and power with PostgreSQL's unaccent module. Perfect for dealing with user-supplied content with inconsistent application of accents:

CREATE EXTENSION IF NOT EXISTS unaccent; SELECT * FROM tbl WHERE unaccent(col) = unaccent('Café'); -- "No Accents Allowed" policy in effect!

Immutable wrapper: Speedy and safe queries

We can play smarter. For a quicker, more effective search, ensconce the unaccent function inside an immutable SQL function. This lets you neatly use it in indexed expressions:

CREATE OR REPLACE FUNCTION ia_unaccent(text) RETURNS text AS $$ SELECT unaccent('unaccent', $1) $$ IMMUTABLE LANGUAGE sql; CREATE INDEX idx_col_unaccent ON tbl USING btree (ia_unaccent(col)); -- Because who doesn't like a brisk query?

Advanced pattern-matching: Unaccent and pg_trgm, the ultimate tag team

Jointly use unaccent with pg_trgm for improved LIKE/ILIKE pattern-matching:

CREATE INDEX trgm_idx ON tbl USING gin (col gin_trgm_ops); SELECT * FROM tbl WHERE unaccent(col) LIKE unaccent('%cafe%'); -- Playing hide and seek with the data

Deploy trigram indexes and experience a radical boost in the performance of left-anchored pattern searches:

SELECT * FROM tbl WHERE unaccent(col) LIKE unaccent('Café%'); -- Hunting specific patterns, marksman style!

Full Text Searches and nuances

Leveraging Full-Text-Search (FTS)

FTS dictionaries can be tailored to strip accents, vastly improving search results but without changing the stored data:

ALTER TEXT SEARCH CONFIGURATION public.english ADD MAPPING FOR hword, hword_part, word WITH english_stem, unaccent;

Subsequent indexing using FTS dictionaries yields fast, accent-insensitive results, though simpler scenarios might prompt directly using unaccent.

Dealing with customized collations

PostgreSQL's collations run on your operating system's locale settings. Crafting a customized collation for accent insensitivity, though possible, is a complex affair. Rely more on functions and extensions such as those discussed above.

Noteworthy pointers and workarounds

Handling accents: Special cases

With respect to ligatures such as 'æ' and 'œ', the roll out from PostgreSQL 9.6 ensures these are accurately depicted in your data, enhancing accuracy and user-friendliness.

SELECT * FROM words WHERE word COLLATE "und-x-icu" = 'œuvre'; -- We don't discriminate ligatures here!

Collations and their limits

While collations influence sorting, they don't interfere with equality checks in where clauses. For comparisons ignoring accents, unaccent or similar functions or expressions are needed.

SELECT * FROM phrases WHERE unaccent(expression) = unaccent('Résumé'); -- Because résumé and resume are the same, right?

Crafting a powerful system

Creating a robust PostgreSQL system supporting accent-insensitive features demands a well-rounded approach. This involves balancing performance, maintaining data integrity, and ensuring an intuitive user experience.