\")\nsafehtml: <script>alert('x')</script>\n","image":"https://explain.codes/media/static/images/eightify-logo.svg","author":{"@type":"Person","name":"Nikita Barsukov","url":"https://explain.codes//author/nikita-barsukov"},"publisher":{"@type":"Organization","name":"Rational Expressions, Inc","logo":{"@type":"ImageObject","url":"https://explain.codes/landing/images/[email protected]"}},"datePublished":"2024-09-30T21:00:04.021Z","dateModified":"2024-09-30T21:00:06.223Z"}
Explain Codes LogoExplain Codes Logo

What's the easiest way to escape HTML in Python?

python
html-escape
python-3
encoding
Nikita BarsukovbyNikita Barsukov·Sep 30, 2024
TLDR

Get on the safe(r) side: use Python's html.escape() to encode your HTML. Poof! Your HTML-injection troubles are gone.

import html safe_html = html.escape("<script>alert('x')</script>") # safe_html: &lt;script&gt;alert('x')&lt;/script&gt;

Leveling up: Dealing with double quotes and ASCII rebels

In the HTML attributes universe, setting quote=True makes double quotes (") safe(er). Welcome to &quot;!

import html safe_attribute = html.escape('"Hello, world!"', quote=True) # safe_attribute: &quot;Hello, world!&quot; # "Hello, world!" *enters witness protection*

Working with rebel non-ASCII characters? Encrypt their plotting:

data = "© 2023 StackExchange" safe_data = data.encode('ascii', 'xmlcharrefreplace') # safe_data: b'&#169; 2023 StackExchange' # "© StackExchange" smoothly turns into "&#169; 2023 StackExchange"

And for Unicode content, clean up (decode) before launching the escape plan.

Digging deeper: Escaping vs Encoding

Understand the game before playing it:

  • Escaping: You're a spy, changing identities for safety.
  • Encoding: You're a shape-shifter, changing forms for conveniences.

Ensure your document encoding matches the encoding in html.escape() for a match made in techie heaven.

Python 3.2+ and the "deprecated" cgi.escape

Post-Python 3.2, stick with html.escape(). Leave the deprecated cgi.escape() in the past where it belongs. Here's why:

# cgi.escape() - Old school import cgi escaped_str = cgi.escape("<b>deprecated</b>") # html.escape() - The cool kid import html escaped_str = html.escape("<b>recommended</b>") # "<b>recommended</b>" *drops mic*

URL Escaping: urllib trumps all

For URLs, urllib library is your squad. It HTML entity escapes URLs for your safety. Code never lies:

from urllib.parse import quote safe_url = quote("<Hello & World>") # safe_url: %3CHello%20%26%20World%3E # "<Hello & World>" pulls a Count Dracula

Embrace a "safety first" approach with MarkupSafe

When robustness is your priority, bet on MarkupSafe. It plays nice with custom methods and template overloads.

from markupsafe import escape escaped_html = escape("<This is safe & sound>") # escaped_html: &lt;This is safe &amp; sound&gt; # "<This is safe & sound>" *buckles up*

With MarkupSafe, you have a champion that suits all Python waters. Tailor the library for your needs by diving into its documentation.

Non-ASCII chars: Correct encoding is key

One rule to remember: correct encoding. It's the Minas Tirith for your non-ASCII characters:

data = "Crème brûlée" safe_html = html.escape(data).encode('utf-8') # safe_html: b'&lt;Cr&#232;me br&#251;l&#233;e&gt;' # "Crème brûlée" *gets a facelift*

Accuracy is essential. Check the header encoding specification to ensure your encoding is on point.