Explain Codes LogoExplain Codes Logo

Regular expression to remove HTML tags from a string

javascript
regex
html-parsing
dom-parser
Anton ShumikhinbyAnton Shumikhin·Jan 2, 2025
TLDR
Strip HTML tags using JavaScript's `replace()` with a regex pattern:
```javascript
let stripped = '<div>Text</div>'.replace(/<[^>]+>/g, '');
console.log(stripped); // Will output "Text" ... easy peasy!

Heads-up: Regex isn't foolproof for intricate HTML; for these scenarios, opt for a DOM parser instead.

Dissecting the regex /[<[^>]+>/g

In this fast solution, the regex /<[^>]+>/g does the heavy lifting. Here's what it means:

  • /.../: The "container" for the regex, like a bag for your coding tools.
  • <: Matches the less-than symbol, the start of an HTML tag.
  • [^>]+: Matches any character except >; + repeats the group.
  • >: Matches the greater-than symbol, the end of an HTML tag.
  • g: Global flag, to match all occurrences, not just the first.

Remember, this method is like peeling off stickers from an old laptop; most of them come off, but some stubborn ones remain. In this context, those could be non-standard or faulty HTML tags, where regex might fail.

Alternatives: DOMParser and Jsoup

While regex can be handy, there are other DOM parsing techniques and libraries such as Jsoup, particularly useful for complex HTML.

DOMParser method

let parser = new DOMParser(); let doc = parser.parseFromString(htmlString, 'text/html'); let text = doc.body.textContent || ""; // Even shorter than my grocery list!

Jsoup method

With Java's Jsoup, employ the Whitelist.none() and clean() methods for a quick cleanup:

String cleanText = Jsoup.clean(htmlContent, Whitelist.none()); // Sparkly clean HTML!

The limitations and efficiency conundrum of regex

Think of regular expressions as a race-car: fast but could skid on complex turns (nested tags or scripts). Here's why:

  • CPU Throttle: Complex patterns can cause excessive backtracking, leading to significant delays.
  • Potholes Ahead: Scripts and CDATA sections might contain > characters, confusing our simple regex.

In contrast, library solutions like Jsoup's clean() are designed specifically for all HTML quirks. It's like driving a car fitted with GPS, ready for every turn the HTML map throws.

Planning for diverse HTML structures

HTML is like a rainbow; you never know which colors you'll encounter next. Ensure your code is adaptable:

  • Crowded Tags: Check if your method can handle numerous different tags.
  • Event Traps: Be wary of handlers like 'OnClick' to avoid hidden scripts.
  • Syntax Goofs: Some web content could have poorly put-together markup; tread cautiously.

Both Regex and Jsoup are your armors in this HTML battlefield. Choose wisely!

Reference materials

  1. RegEx match open tags except XHTML self-contained tags - Stack Overflow
  2. Regular expressions - JavaScript | MDN
  3. Regular expressions
  4. Parsing Html The Cthulhu Way
  5. JavaScript Regular Expression Cheatsheet - Debuggex
  6. Strip HTML tags from text using plain JavaScript - Stack Overflow
  7. Regex Cheat Sheet