Explain Codes LogoExplain Codes Logo

How do you convert Html to plain text?

html
html-agility-pack
regex
html-parsing
Alex KataevbyAlex Kataev·Jan 11, 2025
TLDR

In JavaScript, you use a DOM element's textContent property to transform HTML to text:

var html = "<p>Turn <em>this</em> HTML to Text!</p>"; var temp = document.createElement("div"); temp.innerHTML = html; var text = temp.textContent || temp.innerText || "";

The textContent property gets rid of HTML tags, delivering the pure text: "Turn this HTML to Text!". innerText is used if textContent is not available.

HtmlAgilityPack: A robust .NET solution

To convert HTML into plaintext, consider HtmlAgilityPack, a library in .NET. The ConvertToPlainText method from HtmlUtilities.cs effectively handles HTML conversion, keeping basic formats like <b> and <i>. Free under the MIT license, you can call it the Swiss Army Knife for HTML parsing.

Regular expressions: Basic but beware

Using Regex to strip HTML tags and handle line breaks is an alternative. But beware, edge cases can break your patterns like a bull in a china shop.

// this Tra-la-la will remove pesky HTML tags. It hates HTML. Like cats hate water. var plainText = Regex.Replace(htmlContent, "<.*?>", String.Empty); // and this Ignac will convert "&nbsp;" to simple spaces. It doesn't like complicated stuff. plainText = Regex.Replace(plainText, "&nbsp;", " ");

Regex is quick for simple tasks. But like a cheap perfume, it's not your best choice for an evening event.

Surviving in the wild: Handle user-generated HTML

With user-input HTML, security comes first, and the HtmlAgilityPack shines in this area akin to a lifejacket in a stormy sea. It combats potential XSS attacks. Moreover, libraries help maintain data integrity by decoding entities with methods like WebUtility.HtmlDecode.

Testing needed: Handle different HTML inputs

Battle-test your functions with a wide array of HTML inputs to ensure they stand firm against all forms of HTML weather. Clear code comments guide developers in the right usage of the conversion functions.

Performance: Choose wisely between HtmlAgilityPack and Regex

The decision between HtmlAgilityPack and Regex is a question of comfort vs speed. Go with Regex if you're dealing with basic HTML and performance is crucial. For complex or precision-focused tasks, stick with HtmlAgilityPack.

Advanced strategies for HTML to Text conversion

Let's decode what's encoded

HTML is notorious for placing entities, you need to decode them back to original characters. Here WebUtility.HtmlDecode is your friend:

// presto-zap-change-o! End the tyranny of HTML entities. var decodedText = WebUtility.HtmlDecode(plainText);

Get every bit of the text nodes

For lists and tables, a single sweep might not do. You need to traverse and aggregate text nodes, to maintain the reading order. Like picking up every bit from a lunch buffet, HtmlAgilityPack allows this.

Normalize for better readability

After conversion, you could end up with whitespace or line breaks more than what's desired. Trimming unnecessary spaces or normalizing the text enhances overall readability. Consider it as tidying up after a wild party:

// "You can't handle the whitespace!" - Trim plainText = Regex.Replace(plainText, @"\s+", " ").Trim();