Regular expression to remove HTML tags from a string

javascript

regex

html-parsing

dom-parser

byAnton Shumikhin·Jan 2, 2025

Strip HTML tags using JavaScript's `replace()` with a regex pattern:
```javascript
let stripped = '<div>Text</div>'.replace(/<[^>]+>/g, '');
console.log(stripped); // Will output "Text" ... easy peasy!

Heads-up: Regex isn't foolproof for intricate HTML; for these scenarios, opt for a DOM parser instead.

Dissecting the regex `/[<[^>]+>/g`

In this fast solution, the regex /<[^>]+>/g does the heavy lifting. Here's what it means:

/.../: The "container" for the regex, like a bag for your coding tools.
<: Matches the less-than symbol, the start of an HTML tag.
[^>]+: Matches any character except >; + repeats the group.
>: Matches the greater-than symbol, the end of an HTML tag.
g: Global flag, to match all occurrences, not just the first.

Remember, this method is like peeling off stickers from an old laptop; most of them come off, but some stubborn ones remain. In this context, those could be non-standard or faulty HTML tags, where regex might fail.

Alternatives: DOMParser and Jsoup

While regex can be handy, there are other DOM parsing techniques and libraries such as Jsoup, particularly useful for complex HTML.

DOMParser method

let parser = new DOMParser();
let doc = parser.parseFromString(htmlString, 'text/html');
let text = doc.body.textContent || ""; // Even shorter than my grocery list!

Jsoup method

With Java's Jsoup, employ the Whitelist.none() and clean() methods for a quick cleanup:

String cleanText = Jsoup.clean(htmlContent, Whitelist.none()); // Sparkly clean HTML!

The limitations and efficiency conundrum of regex

Think of regular expressions as a race-car: fast but could skid on complex turns (nested tags or scripts). Here's why:

CPU Throttle: Complex patterns can cause excessive backtracking, leading to significant delays.
Potholes Ahead: Scripts and CDATA sections might contain > characters, confusing our simple regex.

In contrast, library solutions like Jsoup's clean() are designed specifically for all HTML quirks. It's like driving a car fitted with GPS, ready for every turn the HTML map throws.

Planning for diverse HTML structures

HTML is like a rainbow; you never know which colors you'll encounter next. Ensure your code is adaptable:

Crowded Tags: Check if your method can handle numerous different tags.
Event Traps: Be wary of handlers like 'OnClick' to avoid hidden scripts.
Syntax Goofs: Some web content could have poorly put-together markup; tread cautiously.

Both Regex and Jsoup are your armors in this HTML battlefield. Choose wisely!