What's the easiest way to escape HTML in Python?

python

html-escape

python-3

encoding

byNikita Barsukov·Sep 30, 2024

Get on the safe(r) side: use Python's html.escape() to encode your HTML. Poof! Your HTML-injection troubles are gone.

import html
safe_html = html.escape("<script>alert('x')</script>")
# safe_html: &lt;script&gt;alert('x')&lt;/script&gt;

Leveling up: Dealing with double quotes and ASCII rebels

In the HTML attributes universe, setting quote=True makes double quotes (") safe(er). Welcome to "!

import html
safe_attribute = html.escape('"Hello, world!"', quote=True)
# safe_attribute: &quot;Hello, world!&quot;
# "Hello, world!" *enters witness protection*

Working with rebel non-ASCII characters? Encrypt their plotting:

data = "© 2023 StackExchange"
safe_data = data.encode('ascii', 'xmlcharrefreplace')
# safe_data: b'&#169; 2023 StackExchange'
# "© StackExchange" smoothly turns into "&#169; 2023 StackExchange"

And for Unicode content, clean up (decode) before launching the escape plan.

Digging deeper: Escaping vs Encoding

Understand the game before playing it:

Escaping: You're a spy, changing identities for safety.
Encoding: You're a shape-shifter, changing forms for conveniences.

Ensure your document encoding matches the encoding in html.escape() for a match made in techie heaven.

Python 3.2+ and the "deprecated" cgi.escape

Post-Python 3.2, stick with html.escape(). Leave the deprecated cgi.escape() in the past where it belongs. Here's why:

# cgi.escape() - Old school
import cgi
escaped_str = cgi.escape("<b>deprecated</b>")

# html.escape() - The cool kid
import html
escaped_str = html.escape("<b>recommended</b>")
# "<b>recommended</b>" *drops mic*

URL Escaping: urllib trumps all

For URLs, urllib library is your squad. It HTML entity escapes URLs for your safety. Code never lies:

from urllib.parse import quote
safe_url = quote("<Hello & World>")
# safe_url: %3CHello%20%26%20World%3E
# "<Hello & World>" pulls a Count Dracula

Embrace a "safety first" approach with MarkupSafe

When robustness is your priority, bet on MarkupSafe. It plays nice with custom methods and template overloads.

from markupsafe import escape
escaped_html = escape("<This is safe & sound>")
# escaped_html: &lt;This is safe &amp; sound&gt;
# "<This is safe & sound>" *buckles up*

With MarkupSafe, you have a champion that suits all Python waters. Tailor the library for your needs by diving into its documentation.

Non-ASCII chars: Correct encoding is key

One rule to remember: correct encoding. It's the Minas Tirith for your non-ASCII characters:

data = "Crème brûlée"
safe_html = html.escape(data).encode('utf-8')
# safe_html: b'&lt;Cr&#232;me br&#251;l&#233;e&gt;'
# "Crème brûlée" *gets a facelift*

Accuracy is essential. Check the header encoding specification to ensure your encoding is on point.

explain-codes / Python / What's the easiest way to escape HTML in Python?

Linked

Url decode UTF-8 in Python



Unicodeencodeerror: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)



How can I escape a single quote?



How can I percent-encode URL parameters in Python?



How do I properly escape quotes inside HTML attributes?