How can I unescape HTML character entities in Java?
Wondering how to Unescape HTML in Java? The magic spell is StringEscapeUtils.unescapeHtml4()
from that wizard called Apache Commons Text.
Even the library is so courteous, it can be added to your build tool (e.g., Maven/Gradle) without any issues.
All-round solution: Jsoup
Jsoup doesn't shy away from wearing multiple hats. Apart from unescaping HTML entities, it grapples with manipulating and cleaning HTML beautifully. In case of web scraping and sanitization, it's a virtual swiss knife.
Dealing with HTML ancients: HTML 3.x
Handling the grandpas of HTML, the HTML 3.x content, needs a special tool. StringEscapeUtils.unescapeHtml3()
fits the bill.
If you breathe in Spring air: HtmlUtils
If your playground is the Spring framework, then HtmlUtils.htmlUnescape()
is meant for you. It snugly fits into the Spring ecosystem without any additional effort.
Another charm in the toolkit: unbescape
The unbescape library is another weapon in your arsenal for unescaping HTML. It cleverly mimics .NET's HttpUtility.HtmlDecode
functionality.
Your personalised escape plan: Custom HTML entities
Create a bespoke lookup mechanism with Apache Commons. With a customised map that includes HTML 4 symbols, you can cater to out-of-the-box character entities:
Not just named but numeric too: Scenarios
In the wilderness of HTML, predators lurk in forms beyond named entities like <
. Keep your guard up for numeric entities too, often represented by numbers and a hash (e.g., <
for <
). Rejoice as Apache Commons and jsoup bravely tackle them:
Commit to Open Source: Tools
Apache Commons Text, Jsoup, Springβs HtmlUtils, and unbescape, every tool mentioned in this guide is open-source. It's where community fuels continuous improvement and grants liberty for usage in your projects.
Was this article helpful?