How do I unescape HTML entities in a string in Python 3.1?

python

unescape

html-parser

regex

byNikita Barsukov·Feb 13, 2025

You need to leverage Python's html module with the help of the unescape() function. This turns your HTML entities back into their intended characters:

# Python says: "Hold my beer, got this!"
from html import unescape

print(unescape('This &amp; that'))  # prints: This & that

Legacy extension

If you're still partying in the world of Python 3.1, the HTMLParser class from the html.parser module got your back:

# Good ol' Python 3.1 still has some tricks up its sleeve!
from html.parser import HTMLParser
parser = HTMLParser()
print(parser.unescape('This &amp; that'))  # prints: This & that

This HTMLParser().unescape() function switch converts anything from common entities such as & to rare species like " in your strings.

Alternative and helpful methods

Legend of xml.sax.saxutils

There's a lesser known but equally powerful hero - the xml.sax.saxutils module. It too possesses the powers of unescape():

from xml.sax.saxutils import unescape

print(unescape('This &amp; that'))  # prints: This & that

For those who prefer to keep to Python's homegrown capabilities, this is another excellent built-in solution.

Craft your own regex hero

For those situations where you are dealing with complex strings or if you are just a regex maestro, here's a way to forge your own mighty function:

import re
from html.entities import name2codepoint

def unescape_html(text):
    def substitute_entity(match):
        # "For every lock, there is someone out there trying to pick it."
        return chr(name2codepoint[match.group(1)])
    return re.sub(r'&(\w+);', substitute_entity, text)

print(unescape_html('The &lt;em&gt;quick&lt;/em&gt; brown &amp; fox'))  # prints: The <em>quick</em> brown & fox

This baby has its own crafted regex pattern that hunts down entities and replaces them with corresponding unicode characters.

Unicode and hex escaping: trap for the tricksters!

Even sneaky escaped unicode characters cannot hide:

# Python to sneaky unicode: "I see what you did there!"
escaped_str = "The quick brown fox \\u003Cem\\u003Ejumps\\u003C/em\\u003E over the lazy dog"
print(bytes(escaped_str, "ascii").decode("unicode_escape"))

Same goes for regular hexadecimal pranksters:

# Python to hexadecimal: "Nice try, but I got you!"
print(bytes.fromhex('54686520717569636b2062726f776e').decode('utf-8'))

explain-codes / Python / How do I unescape HTML entities in a string in Python 3.1?

Linked

Decode HTML Entities in Python String?



Extracting text from HTML file using Python



How to get HTML from a beautiful soup object



How to Pretty Print HTML to a file, with indentation



Using C# regular expressions to remove HTML tags



Get HTML source of WebElement in Selenium WebDriver using Python



Pretty printing XML in Python



Legacy extension Alternative and helpful methods