Strip HTML from strings in Python

python

html-stripping

beautifulsoup

xss

byAlex Kataev·Aug 17, 2024

To swiftly strip HTML markup from a string in Python, utilize the function re.sub() to spot tags and html.unescape() to convert entities:

import re
from html import unescape

strip_html = lambda text: unescape(re.sub('<[^<]+?>', '', text))
print(strip_html("<h1>Hello, <b>World</b>!</h1>"))  
# The wolf escaped the HTML's clothing and said: Hello, World!

This line of code efficiently tidies up HTML, delivering clean textual content.

Use the html.parser for robust handling

If handling HTML with more precision is a requirement, Python's html.parser module comes to the rescue.

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

print(strip_tags("<h1>Hello, <b>World</b>!</h1>"))  
# Output: Hello, World, still without any fancy HTML clothing!

BeautifulSoup for the simplicity lovers

BeautifulSoup is your ease of use champion that does not compromise on flexibility.

from bs4 import BeautifulSoup

def strip_html_bs(html):
    soup = BeautifulSoup(html, "html.parser")
    return soup.get_text()

print(strip_html_bs("<h1>Hello, <b>World</b>!</h1>"))  
# Output: Hello, World! No soup for the HTML!

BeautifulSoup got your back, eliminating any awkward compromises between simplicity and power.

The cautionary tale of regex

While regex is fantastic for simple patterns, it falls flat on its face when dealing with nested tags or malformed HTML. Be mindful of these limitations when considering regex.

Safeguarding against common issues

Defending against XSS attacks

Note the crucial difference between HTML stripping versus HTML sanitizing. Tools like MarkupSafe and Bleach only keep safe HTML, safeguarding your application against XSS attacks.

Decoding entities for a cleaner look

Remember to convert HTML entities back to their respective characters to achieve an ergonomic, smooth reading experience:

clean_text = strip_html("&lt;HTML&gt; entities like &amp; are decoded&excl;")
print(clean_text)  
# No entities were harmed during the making of this feature: <HTML> entities like & are decoded!

Paying attention to ASCII control characters

Sometimes, along with HTML, you might also want to shoot for the skies and remove ASCII control characters for an even cleaner extraction:

strip_control_chars = lambda text: ''.join(ch for ch in text if ch.isprintable())

This deletes those naughty, control-freak characters that mess with your text.

Designing responsive solutions

You should choose your HTML stripping method based on your specific use-case and environment..

Getting dynamic with object-oriented solutions

Not every solution fits into a lambda function. The MLStripper class from html.parser can be customized for more complex operations.

Navigating through file inputs

The code examples provided here can be adapted to read from file inputs, providing flexibility in processing large documents or bulk processing multiple files.

Being resilient with broken HTML

Given that malformed HTML is a reality, both BeautifulSoup and html.parser exhibit robustness in handling improper HTML markup, effectively extracting text where other methods may falter.

explain-codes / Python / Strip HTML from strings in Python

Linked

Extracting text from HTML file using Python



How to get HTML from a beautiful soup object



How do I unescape HTML entities in a string in Python 3.1?



Get HTML source of WebElement in Selenium WebDriver using Python



How can I strip HTML tags from a string in ASP.NET?

