How to get HTML from a beautiful soup object

python

beautifulsoup

html-parsing

web-scraping

byAlex Kataev·Feb 18, 2025

Speedily derive HTML from a BeautifulSoup entity with soup.prettify() for legible HTML, or str(soup) for non-formatted string. For instance:

from bs4 import BeautifulSoup

# Assuming 'soup' is your BeautifulSoup object...
print(soup.prettify())  # Tidy HTML
print(str(soup))        # Plain HTML string

Detailed breakdown

Let's dive deeper into the nuts and bolts of getting HTML content from BeautifulSoup.

Extracting tags and/or text

To fetch just the text or certain tags only, the below methods come in handy:

# Grabbing text without tags
print(soup.get_text())

# Grabbing a particular tag
print(soup.find('span')) # Who doesn't love some span in their soup?

Knowing your output

Keep in mind that these methods render different outputs:

soup.prettify() adds newlines and indents for clear structure. It's the neater twin.
str(soup) gives you the HTML minus any additional formatting. It's the less fussy twin.

Dealing with special characters

Ensure the correct encoding when confronting unique characters:

# Encoding the output to UTF-8
print(soup.prettify().encode('utf-8')) # Going international

Tips and tricks

Traversing the HTML tree

Use .contents or .children to navigate through the HTML tree:

# Direct children of the soup object
print(soup.contents)

# All children within the soup object, recursively
for child in soup.children:
    print(child) # Who knew soup could have offspring?

Obtaining attributes

Fetch attributes like id, class, or data-attributes using dictionary-like syntax:

# Access 'id' of the first div
print(soup.find('div')['id']) # Seeking ID for a div in soup? Sure!

Handling incomplete tags

Sometimes, HTML tags might be incomplete or missing. BeautifulSoup gracefully handles these situations:

# Parsing incomplete HTML
soup = BeautifulSoup("<div><p>Missing closing tags", 'html.parser')
print(soup.prettify()) # Patchwork performed! Missing tags found.

Handling comments and scripts

Remove or access comments and script tags:

from bs4 import Comment

# Find and remove all comments
comments = soup.findAll(text=lambda text: isinstance(text, Comment))
[comment.extract() for comment in comments] # Off with their heads!

# Find all script tags
script_tags = soup.find_all('script') # Sparks joy? No? Script tags begone!