Stripping out HTML tags from a string

swift

regex-engineering

html-parsing

string-manipulation

byNikita Barsukov·Oct 1, 2024

Want a quick way to shrug off those pesky HTML tags using DOMParser? Here you go:

const stripHTML = html => (new DOMParser().parseFromString(html, 'text/html')).body.textContent || "";
const result = stripHTML("<div>Sample <b>text</b></div>");

Just run the script and guess what result will print: "Sample text". Simple as folding a paper plane, huh?

This method is a no-nonsense approach - ideal for client-side HTML stripping. Just remember, DOMParser is the quieter guy at the party, but he gets along with all modern browsers and isolates your text content from HTML markup without creating a scene.

Slip-n-slide with Swift

In the field of Swift-powered iOS applications, you have to wear a different cape. Trust me, it's cool! Check this Swift extension that uses NSRegularExpression:

extension String {
    var htmlStripped: String {
        let pattern = "<.*?>"
        guard let regex = try? NSRegularExpression(pattern: pattern, options: []) else { return self } // Error? Nah, I got this! Will return the string as it is.
        let range = NSRange(location: 0, length: utf16.count)
        return regex.stringByReplacingMatches(in: self, options: [], range: range, withTemplate: "")
    }
}

To use it, just call:

let html = "<div>Sample <b>text</b></div>"
let stripped = html.htmlStripped // "Sample text"

This here, is how you call NSRegularExpression for a quick dance-off, showing those HTML tags who's boss. Always double-check your regex patterns beforehand. Don't want any gatecrashers, right?

Regex 101: things to consider

While Regular expressions are the djinn of your lamp, carelessly brandishing their powers may lead to consequences:

Don't let their greedy patterns gobble up more than required.
Patterns like "<.*?>" might miss the beat when faced with nested tags or comments.
Regex doesn't always play nice when parsing HTML - just a sibling rivalry with the complexity of HTML.

Stripping complex HTML: Challenge accepted!

When complex HTML tags taunt you, show them your swift hand:

extension String {
    func strippingHTML() -> String {
        guard let data = data(using: .utf8) else { return self }
        let options: [NSAttributedString.DocumentReadingOptionKey : Any] = [
            .documentType: NSAttributedString.DocumentType.html,
            .characterEncoding: String.Encoding.utf8.rawValue
        ]
        return (try? NSAttributedString(data: data, options: options, documentAttributes: nil))?.string ?? self
    }
}

Brace for edge cases

Not all HTML tags play by the rules. Here are some edge cases to watch out for:

Scripts/Style tags: Ensure their JavaScript and CSS contents don't interfere with your final output.
Comments/CDATA: Regex can't always tell if text is a comment or CDATA. Consider using parsing libraries when precision is the name of the game.
Broken/Malformed tags: Could interrupt your regex groove. Using a parser is a safer bet here.