Explain Codes LogoExplain Codes Logo

Remove HTML tags from string including   in C#

csharp
html-cleansing
regex-patterns
string-manipulation
Anton ShumikhinbyAnton Shumikhin·Jan 2, 2025
TLDR

Strip HTML and   from a string in C# using Regex.Replace:

using System.Text.RegularExpressions; // Here's the magical charm to expel those nasty HTML tags! string CleanHtml(string html) => Regex.Replace(html, "<[^>]+>|&nbsp;", "").Trim();

Use CleanHtml(yourHtmlString) to vanish the tags and the non-breaking spaces.

Handling the edge cases

After vanilla HTML cleanups, edge cases might still lurk around. Let's bring them into light and take care of them one by one.

Normalizing the white spaces

Handled all HTML, but the result is riddled with irregular spaces? Just focus and utter another spell:

string NormalizeWhitespace(string text) => Regex.Replace(text, @"\s{2,}", " ");

This replaces runs of spaces with a single space. It's like Hermione's spell for neatly organizing books!

Decoding all entities

Why deal only with &nbsp; when we can decode all entities beforehand - for a true clean sweep. Using HttpUtility.HtmlDecode, we'll make sure we miss nothing.

Handling script-style tags

There is always the danger of <script> and <style> tags ruining your textual feast. Remove them explicitly to ensure a trouble-free dining experience.

The power of StringBuilder

For large data cleanup, you may need to buckle up the StringBuilder armor. It is like the Goliath's sword, slaying strings with ease and efficiency.

Advancing your HTML cleansing

For those pesky HTML strings that slipped through the initial defenses, let's put on our invisibility cloaks and sneak around them.

Repeating until squeaky clean

Sometimes, you need to scrub twice to get all the dirt. Keep repeating the process until nothing remains:

string iterativeCleanHtml(string html) { string prevHtml; // Let's DJ Khaled this - Another one, and another one, till we're done! do { prevHtml = html; html = Regex.Replace(html, @"<[^>]+>|&nbsp;", "").Trim(); } while (html != prevHtml); return html; }

Custom defense spells

Each HTML document has its own peculiarities and extraneous tags. You might need to create your own variety of regex spells.

The mystery of edge cases

Simple regex solutions can be as elusive as a golden snitch. They might fail with complex tags or intricate HTML scenarios. The important thing is - Never stop practicing your broomstick skills (testing!).