Explain Codes LogoExplain Codes Logo

Decode HTML Entities in Python String?

python
html-entities
data-integrity
html-parser
Alex KataevbyAlex Kataev·Aug 5, 2024
TLDR

Converting HTML entities in Python? Use the built-in html.unescape() function. This turns tricky entities like &amp; or &lt; back into familiar friends like & and <. Dig this snippet:

import html decoded_str = html.unescape('Hello &amp; world!') # Output: 'Hello & world!'

No extra libraries needed, just one precious line of Python code for a clear-as-day HTML entity decoding.

What else can I do?

HTML entities - those strange symbols - have a knack for showing up when you least expect. Sometimes they're simple, sometimes not so much. Let's dive into some alternative approaches depending on your Python environment and task complexity.

These are not the entities you're looking for (before Python 3.4)

With older, and dare I say wiser, Python versions - before 3.4 - HTMLParser.unescape() was the go-to guy. You may come across him when wrangling legacy code:

from HTMLParser import HTMLParser parser = HTMLParser() decoded_str = parser.unescape('Hello &amp; world!') # Or, as we say at Hogwarts, 'Hello & world!'

Alert: Hitchhiker's guide to the Python galaxy tells us that this method is deprecated as of Python 3.5! Time to meet his younger, trendier cousin - html.unescape().

Fancy Soup, anyone?

If you're one of the cool kids using HTML parsing libraries like Beautiful Soup 4, it's got your back. Brewing a Soup object is like a magic wand gesture for entities:

from bs4 import BeautifulSoup soup = BeautifulSoup('Hello &amp; world!', 'html.parser') decoded_str = soup.text # So shiny! 'Hello & world!'

Third-party power-up

Fancy a trip off the beaten track? w3lib.html's replace_entities is the indie artist of HTML entity decoding. This exotic tool processes even the stringiest of strings:

from w3lib.html import replace_entities decoded_str = replace_entities('Hello &amp; world!') # Like having your own HTML-English dictionary. 'Hello & world!'

Say my name, say my Unicode name

Sometimes entities use Unicode form instead of HTML entity names. Use unicodedata to crack this secret language:

import unicodedata unicode_str = unicodedata.normalize('NFKD', 'Hello &amp; world!') # We're all just people, and entities are all just characters. 'Hello & world!'

But beware! Danger of Unicode errors lies on the path to exporting decoded data. Watch out and never face the wrath of a messed-up web page.

Compatibility wizardry!

Manage Python 2 and 3 with panache using six. This library is a compatibility lifesaver, including for HTML entity decoding.

from six.moves import html_parser decoded_str = html_parser.HTMLParser().unescape('Hello &amp; world!') # Just like mom's home-cooked Python. 'Hello & world!'

Mean, mean entities

HTML entities aren't always easy and simple. They can bring numerical codes, special symbols, or accented letters to your Python party. Let's not let these tough customers ruin the fun.

Number and symbol crunching

Entities can be numerical or symbolic. Roll up your sleeves and take them head-on:

decoded_num = html.unescape('&#123;') # ' { ' - just uncovered a secret curly brace! decoded_sym = html.unescape('&euro;') # ' € '- found some spare change!

Accented letter wrangling

Entities don't shy away from non-ASCII characters, like accents. Keep your text's integrity intact:

decoded_accent = html.unescape('Caf&eacute;') # ' Café '- coffee break, anyone?

Data integrity police

Keep your encoded vs. decoded data in check. Avoid unsolicited data corruption and win the round-tripping battle.