Remove HTML tags from string including in C#
Strip HTML and
from a string in C# using Regex.Replace
:
Use CleanHtml(yourHtmlString)
to vanish the tags and the non-breaking spaces.
Handling the edge cases
After vanilla HTML cleanups, edge cases might still lurk around. Let's bring them into light and take care of them one by one.
Normalizing the white spaces
Handled all HTML, but the result is riddled with irregular spaces? Just focus and utter another spell:
This replaces runs of spaces with a single space. It's like Hermione's spell for neatly organizing books!
Decoding all entities
Why deal only with
when we can decode all entities beforehand - for a true clean sweep. Using HttpUtility.HtmlDecode
, we'll make sure we miss nothing.
Handling script-style tags
There is always the danger of <script>
and <style>
tags ruining your textual feast. Remove them explicitly to ensure a trouble-free dining experience.
The power of StringBuilder
For large data cleanup, you may need to buckle up the StringBuilder
armor. It is like the Goliath's sword, slaying strings with ease and efficiency.
Advancing your HTML cleansing
For those pesky HTML strings that slipped through the initial defenses, let's put on our invisibility cloaks and sneak around them.
Repeating until squeaky clean
Sometimes, you need to scrub twice to get all the dirt. Keep repeating the process until nothing remains:
Custom defense spells
Each HTML document has its own peculiarities and extraneous tags. You might need to create your own variety of regex spells.
The mystery of edge cases
Simple regex solutions can be as elusive as a golden snitch. They might fail with complex tags or intricate HTML scenarios. The important thing is - Never stop practicing your broomstick skills (testing!).
Was this article helpful?