Strip HTML tags from text using plain JavaScript

javascript

prompt-engineering

functions

callbacks

byNikita Barsukov·Oct 15, 2024

Strip your HTML tags in a flash with JavaScript's replace() and a regex, like a boss: /<[^>]*>/g.

Example:

let cleanText = '<div>Hello, World!</div>'.replace(/<[^>]*>/g, '');
console.log(cleanText); // Says "Hello, World!", but no div in sight.

This approach is like a cat on a hot keyboard for simple HTML, but you gotta be like a hawk for potential security risks with dodgy or complex input.

Cross-browser tag stripping

HTML Tags playing hide and seek across different browsers? Fear no more. Using the DOM methods like textContent or innerText simplifies things.

/** 
* The 'I got this' function.
* Strips the unnecessary wardrobe (HTML tags) off of 'html' string regulars, and returns their birthday suit.
*/
function stripHtml(html) {
   let tempHolder = document.createElement("DIV");
   tempHolder.innerHTML = html;
   return tempHolder.textContent || tempHolder.innerText || "";
}

DOM methods are like a universal translator for browsers, helping you avoid the pitfalls of regex and its fondness for mischief with complex HTML structures. Expect no less from good old DOM!

Handling untrusted content

Dreading potentially hazardous HTML elements, like user-generated content? Give DOMParser a shot for a bulletproof way to parse and get text, all with the grace of not having any ill-intended script in your precinct.

/** 
* The 'Trust no one' function.
* It keeps an extra eye on suspicious characters, making sure no script goes off when handling 'html' input.
*/
function safeStripHtml(html) {
   let parser = new DOMParser();
   let doc = parser.parseFromString(html, 'text/html');
   return doc.body.textContent || "";
}

Implementing DOMParser is your safe bet, as it inherently defuses any uninvited scripts within the HTML content.

Have it safe and clean: Advanced tips

Scrub-a-dub-dub 🧹: Always go for a clean slate. Sanitize user inputs first to keep XSS attacks far, far away. Remember the 'Trust no one function'? It just got a sidekick — the Sanitizer API.
Complexity dislikes regex: Regex's charm can dim when it's a party of nested tags or script-studded HTML. Be the strict bouncer here!
No extra baggage: Trust your native JavaScript tools to strip tags, and you can skip the whole ensemble of libraries like jQuery.
Case uniformity: After pumping out all tags, consider normalizing the text case using toLocaleLowerCase() for a match uniform on the field.

Advanced stripping techniques

Dealing with bold inline styles? Trying to get a uniform look with text normalization? Here's how we roll the dice:

Handling inline styles paradox:

// Who needs fancy clothes? Let's keep it simple.
function stripStyles(html) {
   let doc = new DOMParser().parseFromString(html, 'text/html');
   Array.from(doc.body.querySelectorAll('*')).forEach(node => node.removeAttribute('style'));
   return doc.body.innerHTML;
}

Adding Text Normalization drills:

// When you need everything looking the same.
function normalizeAndStrip(html) {
   let text = safeStripHtml(html);
   return text.toLocaleLowerCase(); // It's casual Friday, everyone's dressing down.
}

Performance considerations

Bigger isn't always better: For large HTML, running innerHTML is like going uphill, consider cliff diving instead; use other techniques.
Domestication of the DOM: If you're feeling like Bob the Builder with multiple operations, detach the node first for a smoother ride.
Locally sourced goodness: Ensure that the data is locally available to avoid watching a loading spinner for ages due to network requests while parsing.