Explain Codes LogoExplain Codes Logo

Write to UTF-8 file in Python

python
utf-8
encoding
unicode
Nikita BarsukovbyNikita Barsukov·Dec 22, 2024
TLDR

Use open with 'w' and encoding='utf-8':

with open('file.txt', 'w', encoding='utf-8') as f: f.write('Some text with special characters: é, ñ, å')

The above code precisely writes an UTF-8 encoded file.txt. But what about the byte order mark (BOM) you ask? That's where 'utf-8-sig' comes to the rescue.

UTF-8 with BOM: friend not foe

To create a UTF-8 file complete with a BOM, use 'utf-8-sig':

with open('file_with_bom.txt', 'w', encoding='utf-8-sig') as f: f.write('BOMs away!') # like bombs away, get it? 😉

'utf-8-sig' quietly adds BOM, no manual labor necessary.

Why stop at the basics?

Inspecting file encoding

Python doesn't include a built-in tool for detecting file encoding, but you can run external commands in a pinch:

import subprocess result = subprocess.run(['file', '-b', '--mime-encoding', 'file.txt'], stdout=subprocess.PIPE) encoding = result.stdout.decode('utf-8').strip() print(f'File encoding detected: {encoding}... like a pro!')

External commands: Where there's a subprocess, there's a way.

Unicode in disguise

To add the BOM manually, go for:

BOM_UNICODE = u'\ufeff'

Or, better yet, you can summon it by name:

import unicodedata BOM_UNICODE = unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE') # BOM's secret identity 🦸

Declaring script encoding: because manners matter

Start your Python script with an encoding declaration to ensure UTF-8 handling without a hitch:

# -*- coding: utf-8 -*-

No non-ASCII character left behind!

Remember to clean up

Python's context managers like with close files for you, but when using codecs.open or file, remember to close():

f = codecs.open('file.txt', 'w', 'utf-8') try: f.write('Please close the door when you leave!') finally: f.close() # because manners matter 👍

Your OS will thank you for not leaving file descriptors hanging.

Venturing into special cases

Keeping it simple: UTF-8 without BOM

A BOM can sometimes shake things up, leading to problems. Keep it simple with encoding='utf-8':

with open('file_no_bom.txt', 'w', encoding='utf-8') as f: f.write('Simply UTF-8, no BOMs allowed.') # Bombs? I meant BOMs! 😅

UnicodeDecodeError: Not on my watch

Occasionally a UnicodeDecodeError can sneak by, often when bytes and strings are mistaken for each other. Make sure your input encoding matches your output.

Writing exotic characters: Tame the beast

Python isn't fazed by unusual characters. For the truly exotic, use Unicode escape sequences or named characters:

with open('unicorns.txt', 'w', encoding='utf-8') as f: f.write('🦄 can be written as \U0001F984 or \N{UNICORN FACE}') # Unicorn trivia in code, why not? 🌈