Explain Codes LogoExplain Codes Logo

Bash script to convert from HTML entities to characters

bash
html-entities
command-line-tools
text-processing
Alex KataevbyAlex Kataev·Jan 24, 2025
TLDR

Instantly translate HTML entities to characters using sed:

echo "String with &lt;entities&gt;" | sed 's/&lt;/</g; s/&gt;/>/g; s/&amp;/&/g;'

This snippet quickly swaps &lt;, &gt;, and &amp; with <, >, &. It gets the job done for these entities, but for a broader range, you should adopt a much robust approach, such as a script utilizing perl for comprehensive entity coverage.

Decoding with recode for full coverage

For a comprehensive solution that handles all HTML entities, recode is your friend:

# Who needs a subtle approach when you can use a cannon to kill a fly? echo 'String with &#x27;entities&#x27;' | recode html..utf8

To install recode on Linux, use sudo apt-get install recode. For Mac OS, brew install recode is your ticket.

Detailed solution: Perl

Another versatile tool is perl. It's like a Swiss Army knife for programmers. Install the HTML::Entities module via CPAN:

# Perl: The Power Rangers of Programming Languages - It has a solution for everything! echo 'String &copy; with entities' | perl -MHTML::Entities -pe 'decode_entities($_);'

This will transform every recognized HTML entity into its corresponding character.

Direct decoding: PHP and Python

If you are more comfortable with PHP or Python, you can use these too! Check this out:

# PHP: Not just good for creating websites, eh? php -r "echo html_entity_decode('String with &mdash; entities', ENT_QUOTES, 'UTF-8');"

And with Python, it becomes a cakewalk:

# Python: Like coffee, once you get used to it, you can't live without it! python -c "import html; print(html.unescape('String with &cent; entities'))"

Moreover, with Python you can use list comprehensions to process multiple lines of entities efficiently.

Command-line heroes: w3m and xmlstarlet

In some cases, you might have a limited software access. Well, w3m and xmlstarlet can save your day:

# w3m: This ain't your grandma's browser! echo 'String &laquo;with&raquo; entities' | w3m -dump -T text/html # xmlstarlet: The Dark Knight of XML Processing! echo 'hello &amp;lt; world' | xmlstarlet unesc

These tools provide efficient conversion even in the most restrictive environments.

Handling large files: Cat command and Python

For larger files, the cat command, coupled with Python for conversion, can be a practical approach:

# Cat: More than just a laser pointer obsessed creature! cat yourfile.html | python -c "import html; import sys; [sys.stdout.write(html.unescape(line)) for line in sys.stdin]"

This allows for efficient line-by-line processing, especially handy when dealing with large amounts of entities.

Keeping it simple and understandable

Strategies like recode, Python, and Perl provide straightforward methods, ensuring maintainability:

# Sed: Turning coding into a magic show! echo "String with &amp; entities" | sed 's/&amp;/\&/g'

Adapting to diverse environments

Different environments have different software support. Whether it be recode, w3m, xmlstarlet, or other scripting languages, you have a mixed bag of tools to convert HTML entities.

Avoid the pitfalls of regex

While regular expressions can be used, they can get complicated for complex patterns like HTML entities. Tools like recode and xmlstarlet provide efficient alternatives without the headache.