Unicode (UTF-8) reading and writing to files in Python

python

unicode

utf-8

encoding

byAnton Shumikhin·Sep 10, 2024

Dealing with UTF-8 in Python is easy as pie, just use open and don't forget encoding='utf-8':

Read file:

#Grabbing that sweet sweet text
with open('file.txt', encoding='utf-8') as f:
    text = f.read()  # Ultimate ledger of knowledge acquired!

Write file:

#Time to pour our soul into the file
with open('file.txt', 'w', encoding='utf-8') as f:
    f.write('Unicode текст')  # Hammering down some Unicode

Remember, 'utf-8' is your best friend when dealing with Unicode characters in files.

Python 2 and 3: Dealing with differences

Using Python 2? io.open is your lifesaver

Python 2 doesn't have native encoding in open(). Fear not, io.open is here to rescue:

#importing our knight in shining armor
import io
with io.open('file.txt', encoding='utf-8') as f:
    text = f.read()  # Ah! fantastic, we're not dinosaurs anymore!

Python 3 encoding: Simple and elegant

Python 3 saw encoding and said, "I got this, fam!":

#Slice of life for a Python 3 user
with open('file.txt', encoding='utf-8') as f:
    text = f.read()  # Just another day in paradise.

codecs: An old friend in need

codecs serves as io's alternative and can do wonders for you:

import codecs
with codecs.open('file.txt', 'r', 'utf-8') as f:
    text = f.readlines()  # Reading lines, not between them!

One word of caution: Mixing read() and readline() could brew a chaotic concoction with codecs.open.

Encounter of the encoding kind

Handling errors: An art

The errors parameter in open might just save your day if encoding/decoding errors arise:

with open('file.txt', 'r', encoding='utf-8', errors='replace') as f:
    text = f.read()  # I see no errors, only "features"

When bytes bite

If special characters are involved, open as bytes and decode:

with open('file.bin', 'rb') as f:
    bytes_content = f.read()
text = bytes_content.decode('utf-8')  # Back to strings, because bytes can't bite us here!

Python way of escape

It's all about escape

When ASCII to Unicode comes in, understanding escape sequences is no less than a magic trick:

In Python 2.x:

# Making newlines newline again!
print u'Hello\\nWorld'.decode('string_escape')

In Python 3.x:

# Transforming '\\n' to '\n', because we can!
print('Hello\\nWorld'.encode().decode('unicode_escape'))

The Great Unicode Divide

Knowledge is key: Understanding Unicode handling in Python 2.x and 3.x can save a day's worth of headache. Python 2 gives you a u for Unicode strings. But in Python 3, every string is a beautiful unicorn... I mean, Unicode.