How can I unescape HTML character entities in Java?

java

html-entities

unescape

jsoup

byNikita Barsukov·Sep 7, 2024

Wondering how to Unescape HTML in Java? The magic spell is StringEscapeUtils.unescapeHtml4() from that wizard called Apache Commons Text.

import org.apache.commons.text.StringEscapeUtils;
// You know why Java developers wear glasses? Because they don't C# (See sharp Get it? 😂)
String unescapedHtml = StringEscapeUtils.unescapeHtml4("&lt;div&gt;Demo&lt;/div&gt;");
System.out.println(unescapedHtml); // <div>Demo</div>

Even the library is so courteous, it can be added to your build tool (e.g., Maven/Gradle) without any issues.

All-round solution: Jsoup

Jsoup doesn't shy away from wearing multiple hats. Apart from unescaping HTML entities, it grapples with manipulating and cleaning HTML beautifully. In case of web scraping and sanitization, it's a virtual swiss knife.

import org.jsoup.Jsoup;
// HTML elements walk into a bar, bartender yells: "Hey! You're not allowed in here, block heads!"😂 
String unescapedHtml = Jsoup.parse("&lt;p&gt;Hello, World!&lt;/p&gt;").text();
System.out.println(unescapedHtml); // Hello, World!

Dealing with HTML ancients: HTML 3.x

Handling the grandpas of HTML, the HTML 3.x content, needs a special tool. StringEscapeUtils.unescapeHtml3() fits the bill.

// Time to go Back to the Future!😁 
String unescapedHtml3 = StringEscapeUtils.unescapeHtml3("&#x26;lt;Old HTML&#x26;gt;");
System.out.println(unescapedHtml3); // &lt;Old HTML&gt;

If you breathe in Spring air: HtmlUtils

If your playground is the Spring framework, then HtmlUtils.htmlUnescape() is meant for you. It snugly fits into the Spring ecosystem without any additional effort.

import org.springframework.web.util.HtmlUtils;
// Nothing springs a surprise like Spring itself! 😇
String unescapedHtmlSpring = HtmlUtils.htmlUnescape("&lt;span&gt;Spring Power&lt;/span&gt;");
System.out.println(unescapedHtmlSpring); // <span>Spring Power</span>

Another charm in the toolkit: unbescape

The unbescape library is another weapon in your arsenal for unescaping HTML. It cleverly mimics .NET's HttpUtility.HtmlDecode functionality.

import org.unbescape.html.HtmlEscape;
// Ahoy! Found the treasure, let's unbescape! 😜
String unescapedHtmlUnbescape = HtmlEscape.unescapeHtml("&lt;ul&gt;&lt;li&gt;unbescape Magic&lt;/li&gt;&lt;/ul&gt;");
System.out.println(unescapedHtmlUnbescape); // <ul><li>unbescape Magic</li></ul>

Your personalised escape plan: Custom HTML entities

Create a bespoke lookup mechanism with Apache Commons. With a customised map that includes HTML 4 symbols, you can cater to out-of-the-box character entities:

import org.apache.commons.text.translate.CharSequenceTranslator;
import org.apache.commons.text.translate.LookupTranslator;

// Map the minefield and teach HTML some manners! 🕵️‍♀️
Map<CharSequence, CharSequence> customMap = new HashMap<>();
customMap.put("&apos;", "'");
customMap.put("&euro;", "€");
// Sky's the limit for custom mappings

CharSequenceTranslator customUnescaper = new LookupTranslator(Collections.unmodifiableMap(customMap));
String customUnescapedHtml = customUnescaper.translate("&apos;&euro;");
System.out.println(customUnescapedHtml); // '€

Not just named but numeric too: Scenarios

In the wilderness of HTML, predators lurk in forms beyond named entities like <. Keep your guard up for numeric entities too, often represented by numbers and a hash (e.g., < for <). Rejoice as Apache Commons and jsoup bravely tackle them:

// Double agent mission: Both named and numeric entities must be uncoded. 😎 
String namedEntity = "&copy;"; // ©
String numericEntity = "&#169;"; // also ©

Commit to Open Source: Tools

Apache Commons Text, Jsoup, Spring’s HtmlUtils, and unbescape, every tool mentioned in this guide is open-source. It's where community fuels continuous improvement and grants liberty for usage in your projects.

explain-codes / Java / How can I unescape HTML character entities in Java?

Linked

How can I efficiently parse HTML with Java?



Remove HTML tags from a String



How to render HTML string as real HTML?



Strip HTML tags from text using plain JavaScript



Can I remove script tags with BeautifulSoup?

