Explain Codes LogoExplain Codes Logo

How to change the CHARACTER SET (and COLLATION) throughout a database?

sql
database-migration
character-set
collation
Nikita BarsukovbyNikita Barsukov·Sep 24, 2024
TLDR

To promptly convert your entire database to a new CHARACTER SET and COLLATION, here's a handy script using ALTER TABLE commands. This one changes everything to utf8mb4 and utf8mb4_unicode_ci:

SELECT CONCAT('ALTER TABLE `', table_schema, '`.`', table_name, '` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;') FROM information_schema.TABLES WHERE table_schema = 'your_db_name';

Output of the command above needs to be executed on your SQL client. Don't forget to replace 'your_db_name' with the name of your database. Like every film hero would say, "Safety first". Backup your database before diving into the process.

Primer on CHARACTER SET and COLLATION

So you're getting cozy with utf8mb4 and utf8mb4_0900_ai_ci. The utf8mb4 CHARACTER SET takes UTF-8 to its fullest potential, even including emojis and a wide array of Asian languages. It's effectively a power-up, a Super Mario Mushroom if you will, for the older utf8 set, capped at a three-byte Unicode subset.

With respect to COLLATION, utf8mb4_0900_ai_ci implements accent-insensitive and case-insensitive sorting. It's the referee that decides sorting and comparison rules for your text data, and varies from its utf8_general_ci and utf8_unicode_ci precursors.

And like ordering a quadruple cheeseburger, getting more means consuming more - data size requirements could increase with utf8mb4. Don't forget to reevaluate and adjust your table column lengths before proceeding.

Suit Up for the Migration

Batch tackling tables and columns

An ensemble of ALTER TABLE MODIFY commands is needed for each column that'll change. Flex your scripting skills with Perl, or a batch SQL script, to create these commands. And here's a pro tip: bump up the group_concat_max_len session variable to handle lengthy ALTER statements, kind of like loosening your belt before Thanksgiving dinner.

System schemas and default values

Remember the 'sys' and 'mysql' schemas? The lifeblood of MySQL functionalities, they're best left unaltered in this process. And while you're at it, make sure default values are properly escaped to avoid accidental changes, kind of like making sure your coffee doesn't spill on your laptop.

Enum and set data types

Say hello to our special guests: enum and set data types. They tend to demand explicit redefinition during the collation change, like a celebrity with a lengthy rider.

Exclude views, prevent corruption

When changing tables, views must be excluded. Think of them as mirrors reflecting the underlying tables, altering them directly might just shatter the reflection. Once the tables are updated, the changes ripple to the views too.

Before and after the collation changes, ensure you maintain the purity of your data. Consider storing a representation of your data in memory or someplace safe for detailed comparison post-migration. It's like a before and after photo in a weight loss ad.

From Latin1 to UTF-8

If you're moving from latin1 to utf8mb4, brace yourself for a significant character support upgrade. Though not foolproof, there's potential for data loss. As always, secure a backup before proceeding.