Remove HTML tags from a String
Quickly remove HTML tags in Java with a simple regex replace:
This regex "<.*?>"
matches any block within angle brackets, making it a handy one-liner. However, treating HTML as a string, has limitations.
To handle HTML content responsibly, you would want a full-fledged HTML parser like Jsoup:
Jsoup does more than just removing tags. It handles malformed HTML, deals with special characters, and still produces reliable results.
Opting for HTML parsers over regex
While the regex fix may seem clever, a thorough parse is not what regex is designed for. It can struggle with complicated HTML, scripts, and unclosed tags.
By adopting Jsoup, you have the benefits of a library that intuitively understands structure, similar to reading through human eyes:
That understanding ensures unambiguous content extraction and removes room for accidental triggering of any malicious scripts.
Custom behavior with Jsoup whitelist
Jsoup comes in with the handy clean
method, enabling you to define acceptable tags via a whitelist:
HTML: A house with open doors
HTML is complex. Jsoup's whitelist can enforce rules allowing specific HTML entities as per your settings.
When you don't need certain tags but can't afford to lose the information they contain, you adjust your whitelist settings. The result? You keep the content, get rid of the clutter.
Sanitizing the string and avoiding XSS
For applications like JSP/Servlet, where users have the freedom to input data, sanitization of input is critical to prevent Cross-Site Scripting (XSS).
Utilize OWASP's Java HTML Sanitizer to guarantee input is free from potentially harmful scripts:
HTMLCleaner: Another soap in the rack
If JTidy isn't the tool that fits you, the HTMLCleaner might just be. It offers fine control over tag removal and is very convenient for HTML cleanups:
Android Developer? Consider HTMLCompat
In the world of Android, use androidx.core.text.HtmlCompat
with FROM_HTML_MODE_LEGACY
as an alternative to handle HTML strings:
Was this article helpful?