Explain Codes LogoExplain Codes Logo

Remove HTML tags from a String

java
html-parsing
jsoup
html-sanitization
Anton ShumikhinbyAnton Shumikhin·Jan 8, 2025
TLDR

Quickly remove HTML tags in Java with a simple regex replace:

String plainText = htmlString.replaceAll("<.*?>", "");

This regex "<.*?>" matches any block within angle brackets, making it a handy one-liner. However, treating HTML as a string, has limitations.

To handle HTML content responsibly, you would want a full-fledged HTML parser like Jsoup:

String plainText = Jsoup.parse(htmlString).text();

Jsoup does more than just removing tags. It handles malformed HTML, deals with special characters, and still produces reliable results.

Opting for HTML parsers over regex

While the regex fix may seem clever, a thorough parse is not what regex is designed for. It can struggle with complicated HTML, scripts, and unclosed tags.

By adopting Jsoup, you have the benefits of a library that intuitively understands structure, similar to reading through human eyes:

String trustworthyOutput = Jsoup.parse(htmlCode).text(); // Crack open a cold one, Html code's hand-Parsed and safe!

That understanding ensures unambiguous content extraction and removes room for accidental triggering of any malicious scripts.

Custom behavior with Jsoup whitelist

Jsoup comes in with the handy clean method, enabling you to define acceptable tags via a whitelist:

Whitelist whitelist = Whitelist.simpleText(); // Customize to your case String cleanHtml = Jsoup.clean(htmlCode, whitelist); // No junks allowed, only my favorite tags!

HTML: A house with open doors

HTML is complex. Jsoup's whitelist can enforce rules allowing specific HTML entities as per your settings.

When you don't need certain tags but can't afford to lose the information they contain, you adjust your whitelist settings. The result? You keep the content, get rid of the clutter.

Sanitizing the string and avoiding XSS

For applications like JSP/Servlet, where users have the freedom to input data, sanitization of input is critical to prevent Cross-Site Scripting (XSS).

Utilize OWASP's Java HTML Sanitizer to guarantee input is free from potentially harmful scripts:

PolicyFactory policy = SanitizerPolicies.getExamplePolicy(); // The "guard" policy you've trusted to keep you safe String cleanHTML = policy.sanitize(userHTMLInput); // Yoohoo, the sanitizer checked the club; No unauthorized scripts allowed!

HTMLCleaner: Another soap in the rack

If JTidy isn't the tool that fits you, the HTMLCleaner might just be. It offers fine control over tag removal and is very convenient for HTML cleanups:

// And remember, don't drop the soap... in this case, the HTMLCleaner!

Android Developer? Consider HTMLCompat

In the world of Android, use androidx.core.text.HtmlCompat with FROM_HTML_MODE_LEGACY as an alternative to handle HTML strings:

String plainText = HtmlCompat.fromHtml(htmlString, HtmlCompat.FROM_HTML_MODE_LEGACY).toString(); // Yep, it handles even the oldest, gnarly and legacy folklores of HTML!