Explain Codes LogoExplain Codes Logo

How can I unescape HTML character entities in Java?

java
html-entities
unescape
jsoup
Nikita BarsukovbyNikita BarsukovΒ·Sep 7, 2024
⚑TLDR

Wondering how to Unescape HTML in Java? The magic spell is StringEscapeUtils.unescapeHtml4() from that wizard called Apache Commons Text.

import org.apache.commons.text.StringEscapeUtils; // You know why Java developers wear glasses? Because they don't C# (See sharp Get it? πŸ˜‚) String unescapedHtml = StringEscapeUtils.unescapeHtml4("&lt;div&gt;Demo&lt;/div&gt;"); System.out.println(unescapedHtml); // <div>Demo</div>

Even the library is so courteous, it can be added to your build tool (e.g., Maven/Gradle) without any issues.

All-round solution: Jsoup

Jsoup doesn't shy away from wearing multiple hats. Apart from unescaping HTML entities, it grapples with manipulating and cleaning HTML beautifully. In case of web scraping and sanitization, it's a virtual swiss knife.

import org.jsoup.Jsoup; // HTML elements walk into a bar, bartender yells: "Hey! You're not allowed in here, block heads!"πŸ˜‚ String unescapedHtml = Jsoup.parse("&lt;p&gt;Hello, World!&lt;/p&gt;").text(); System.out.println(unescapedHtml); // Hello, World!

Dealing with HTML ancients: HTML 3.x

Handling the grandpas of HTML, the HTML 3.x content, needs a special tool. StringEscapeUtils.unescapeHtml3() fits the bill.

// Time to go Back to the Future!😁 String unescapedHtml3 = StringEscapeUtils.unescapeHtml3("&#x26;lt;Old HTML&#x26;gt;"); System.out.println(unescapedHtml3); // &lt;Old HTML&gt;

If you breathe in Spring air: HtmlUtils

If your playground is the Spring framework, then HtmlUtils.htmlUnescape() is meant for you. It snugly fits into the Spring ecosystem without any additional effort.

import org.springframework.web.util.HtmlUtils; // Nothing springs a surprise like Spring itself! πŸ˜‡ String unescapedHtmlSpring = HtmlUtils.htmlUnescape("&lt;span&gt;Spring Power&lt;/span&gt;"); System.out.println(unescapedHtmlSpring); // <span>Spring Power</span>

Another charm in the toolkit: unbescape

The unbescape library is another weapon in your arsenal for unescaping HTML. It cleverly mimics .NET's HttpUtility.HtmlDecode functionality.

import org.unbescape.html.HtmlEscape; // Ahoy! Found the treasure, let's unbescape! 😜 String unescapedHtmlUnbescape = HtmlEscape.unescapeHtml("&lt;ul&gt;&lt;li&gt;unbescape Magic&lt;/li&gt;&lt;/ul&gt;"); System.out.println(unescapedHtmlUnbescape); // <ul><li>unbescape Magic</li></ul>

Your personalised escape plan: Custom HTML entities

Create a bespoke lookup mechanism with Apache Commons. With a customised map that includes HTML 4 symbols, you can cater to out-of-the-box character entities:

import org.apache.commons.text.translate.CharSequenceTranslator; import org.apache.commons.text.translate.LookupTranslator; // Map the minefield and teach HTML some manners! πŸ•΅οΈβ€β™€οΈ Map<CharSequence, CharSequence> customMap = new HashMap<>(); customMap.put("&apos;", "'"); customMap.put("&euro;", "€"); // Sky's the limit for custom mappings CharSequenceTranslator customUnescaper = new LookupTranslator(Collections.unmodifiableMap(customMap)); String customUnescapedHtml = customUnescaper.translate("&apos;&euro;"); System.out.println(customUnescapedHtml); // '€

Not just named but numeric too: Scenarios

In the wilderness of HTML, predators lurk in forms beyond named entities like &lt;. Keep your guard up for numeric entities too, often represented by numbers and a hash (e.g., &#x3C; for <). Rejoice as Apache Commons and jsoup bravely tackle them:

// Double agent mission: Both named and numeric entities must be uncoded. 😎 String namedEntity = "&copy;"; // © String numericEntity = "&#169;"; // also ©

Commit to Open Source: Tools

Apache Commons Text, Jsoup, Spring’s HtmlUtils, and unbescape, every tool mentioned in this guide is open-source. It's where community fuels continuous improvement and grants liberty for usage in your projects.