Explain Codes LogoExplain Codes Logo

How do I unescape HTML entities in a string in Python 3.1?

python
unescape
html-parser
regex
Nikita BarsukovbyNikita Barsukov·Feb 13, 2025
TLDR

You need to leverage Python's html module with the help of the unescape() function. This turns your HTML entities back into their intended characters:

# Python says: "Hold my beer, got this!" from html import unescape print(unescape('This & that')) # prints: This & that

Legacy extension

If you're still partying in the world of Python 3.1, the HTMLParser class from the html.parser module got your back:

# Good ol' Python 3.1 still has some tricks up its sleeve! from html.parser import HTMLParser parser = HTMLParser() print(parser.unescape('This & that')) # prints: This & that

This HTMLParser().unescape() function switch converts anything from common entities such as & to rare species like " in your strings.

Alternative and helpful methods

Legend of xml.sax.saxutils

There's a lesser known but equally powerful hero - the xml.sax.saxutils module. It too possesses the powers of unescape():

from xml.sax.saxutils import unescape print(unescape('This & that')) # prints: This & that

For those who prefer to keep to Python's homegrown capabilities, this is another excellent built-in solution.

Craft your own regex hero

For those situations where you are dealing with complex strings or if you are just a regex maestro, here's a way to forge your own mighty function:

import re from html.entities import name2codepoint def unescape_html(text): def substitute_entity(match): # "For every lock, there is someone out there trying to pick it." return chr(name2codepoint[match.group(1)]) return re.sub(r'&(\w+);', substitute_entity, text) print(unescape_html('The &lt;em&gt;quick&lt;/em&gt; brown &amp; fox')) # prints: The <em>quick</em> brown & fox

This baby has its own crafted regex pattern that hunts down entities and replaces them with corresponding unicode characters.

Unicode and hex escaping: trap for the tricksters!

Even sneaky escaped unicode characters cannot hide:

# Python to sneaky unicode: "I see what you did there!" escaped_str = "The quick brown fox \\u003Cem\\u003Ejumps\\u003C/em\\u003E over the lazy dog" print(bytes(escaped_str, "ascii").decode("unicode_escape"))

Same goes for regular hexadecimal pranksters:

# Python to hexadecimal: "Nice try, but I got you!" print(bytes.fromhex('54686520717569636b2062726f776e').decode('utf-8'))