Explain Codes LogoExplain Codes Logo

How can I strip HTML tags from a string in ASP.NET?

html
htmlagilitypack
html-tag-stripping
regex
Alex KataevbyAlex Kataev·Jan 17, 2025
TLDR

Stripping HTML tags is as easy as this Regex.Replace magic trick in ASP.NET:

string cleanText = Regex.Replace(dirtyHtml, "<[^>]+>", string.Empty);

The charm <[^>]+> disarms all HTML tags, and string.Empty banishes them, leaving you with pure text.

Limitations of regex in HTML tags stripping

Now, interrupting our regularly scheduled programming; Regex is great, but tl;dr: It has its moments of tripping over shoelaces. Handling stubborn HTML character entities and nested tags aren't its forte. And, when it sees ">" in attribute values, it's like a deer in headlights. So, for a user-proof, reliable HTML tag stripping, consider calling in the big guns, aka a library with parsing power such as HtmlAgilityPack.

Using HtmlAgilityPack for robust HTML parsing

HtmlAgilityPack, a .NET library, is for HTML what a Swiss Army knife is for... well, everything. It makes removing HTML tags feel like snapping fingers. To use it:

  1. Whisper "HtmlAgilityPack" to NuGet, it's basically the candy store for .NET goodies.
  2. Load your HTML into an HtmlDocument object as if it's the guest of honor.
  3. Use the InnerText property to retrieve text without tags; no autographs, please.

Here's how to enchant with code:

var htmlDoc = new HtmlAgilityPack.HtmlDocument(); // Behold, the magical parchment htmlDoc.LoadHtml(dirtyHtml); // Take that, dirty HTML! string cleanText = htmlDoc.DocumentNode.InnerText; // And voila! Clean as a newborn unicorn

Converting the ancient HTML entities

Tags or not, HTML feels naked without its HTML character entities like &amp;. To convert them back to recognizable characters, use HttpUtility.HtmlDecode:

cleanText = HttpUtility.HtmlDecode(cleanText); // Bibbidi-bobbidi-boo!

Things to consider when using regex and HtmlAgilityPack

  • Performance: Its a snail vs cheetah scenario; regex is slower compared to parsers. Especially with large strings, a parser wins the race.
  • Safety first: Regex and XSS can be an explosive combination. A slip up in regex might make your program vulnerable.
  • Keep it clean: Regex is like a dirty room. HtmlAgilityPack is your cleaning service; it makes your code more readable and maintainable.
  • The fantastic four: Scripts, comments, CDATA sections and edge cases can make you trip in regex. HtmlAgilityPack takes care of these smoothly.

HtmlAgilityPack vs regex

While regex can be quick and dirty, parsers like HtmlAgilityPack offer reliability and maintainability:

  • Sleek handling of complex HTML structures.
  • Better performance with sizable or complex HTML inputs.
  • No false alarm when dealing with content within a tag's attributes.

A checklist for HTML tag stripping

  1. Sanitize: Always clean the veggies (user input) before cooking.
  2. Normalize: Post stripping, don't forget to normalize whitespace and trim the text.
  3. Decode: Replace HTML character entities with their original characters.
  4. Test: Because who knows what might break your function! Test your method with various pitfall strings to ensure its survival.