Extracting text from HTML file using Python

python

text-extraction

beautifulsoup

html2text

byAlex Kataev·Sep 26, 2024

Here's the quick way of extracting text from HTML using Python with the BeautifulSoup library:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text(strip=True)

print(text) # Voila! Magic soup serving hot text!

Just use pip install beautifulsoup4 to get the BeautifulSoup magic happening. The get_text(strip=True) function will elegantly strip away those extra whitespaces and HTML tags, leaving you with clean, readable text.

Stripping down script and style content

Sometimes your HTML can have some unwanted layers, like <script> or <style> tags, hiding the core information you're interested in. Remember, Python is like your personal HTML stylist, helping you get rid of unwanted scripts and style tags:

for script_or_style in soup(["script", "style"]):
    script_or_style.decompose()  # "You're not my style", says Python

clean_text = soup.get_text(strip=True)

Your text will emerge, free from encumbering scripts and styles - a fashion statement in text extraction!

Tackling intricate HTML

While HTML can occasionally show its more complicated side, with elements like HTML entities and encoding popping up, BeautifulSoup is ready to interpret these correctly without any post-cleanup. For instance, & will automatically be converted into &.

But for rough and tough HTML structure, you need a sturdy tool like html2text:

import html2text

text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text = text_maker.handle(html_content)

print(text) # tough HTML? html2text is tougher!

Robust extraction with html2text

In the war against tangled HTML, html2text is your trusty sidekick. This library handles formatting issues that can make the transformation from HTML to plain text complex, such as line breaks and paragraphs, producing plain text comparable to the manual method of copy-pasting. Remember to check the repository on GitHub for the freshest updates.

Post-extraction refinement

Have the urge to further refine your text and trim the fat? This is where the power of regular expressions comes in handy:

import re

# Example: Removing pesky URLs from the extracted text
cleaner_text = re.sub(r'http\S+', '', clean_text)

print(cleaner_text) # URLs? Not on my watch, says Python!

Remember, regular expressions are not an all-in-one solution for parsing HTML but can be a powerful tool for a bit of post-workout shaping!

Custom extraction with _DeHTMLParser

If you're in need of a specialized workout routine for your text extraction efforts, _DeHTMLParser might be your personal trainer. It can handle messy HTML and customize extractions according to your specific needs:

from _DeHTMLParser import dehtml

dehtml_parser = dehtml()
extracted_text = dehtml_parser.text(html_content)

print(extracted_text) # "I don't break a sweat", says _DeHTMLParser!