Explain Codes LogoExplain Codes Logo

How to convert Strings to and from UTF8 byte arrays in Java

java
encoding
utf8
byte-arrays
Anton ShumikhinbyAnton Shumikhin·Nov 18, 2024
TLDR

Here's your quick fix.

Encode a String to UTF-8 bytes:

// What does a String say before it becomes bytes? "Byte me!" byte[] bytes = "example".getBytes(StandardCharsets.UTF_8);

Decode bytes back to a String:

// Bytes to String: the real string theory! String text = new String(bytes, StandardCharsets.UTF_8);

The road may get bumpy. Be ready to handle exceptions or specify the charset in a throws clause.

Deep dive: encoding and charset

Alright, coding wizard, let's get into it.

While encoding strings, stay as consistent as your morning coffee. Using StandardCharsets.UTF_8 prevents both typos and anxiety attacks. Here's how:

Encode like a boss:

// To the byte land we go! byte[] utf8Bytes = someString.getBytes(StandardCharsets.UTF_8);

Remember to decode with the same charset for symmetry:

// Hello bytes, welcome back to String land! String fromUtf8Bytes = new String(utf8Bytes, StandardCharsets.UTF_8);

Using StandardCharsets.UTF_8 beautifies your code and says goodbye to messy charset lookups.

Data loss: 404 not found?

Warning: encoding and decoding can sometimes turn into a wild game of hide and seek. Here's how to win:

Ensure the charset used for decoding matches the bytes' original encoding. If you're using StandardCharsets.US_ASCII, beware of non-ascii characters. Here's how:

// ASCII encode: Easy peasy lemon squeezy! byte[] asciiBytes = "example".getBytes(StandardCharsets.US_ASCII); // ASCII decode: We meet again! String asciiString = new String(asciiBytes, StandardCharsets.US_ASCII);

Be mindful, ASCII supports only 128 characters. Misusing it with UTF-8 bytes is like pouring water into a basket. Match the charset to the byte array's encoding.

The cavalry: utilities and third-party libraries

When Java tools feel like juggling with one hand, third-party libraries can lend you the other:

Apache Commons IO to the rescue:

// You scratch my back, I'll scratch yours - Apache IO, probably. byte[] bytes = StringUtils.getBytesUtf8("example"); String text = StringUtils.newStringUtf8(bytes);

Google Guava for the win:

// Google Guava: Admit it, you just love saying 'Guava'! byte[] bytes = "example".getBytes(Charsets.UTF_8); String text = new String(bytes, Charsets.UTF_8);

Third-party libraries: for when you just can't be bothered to reinvent the wheel.

Visualization of encoding and decoding

Imagine encoding and decoding with UTF8 in Java as shuttling between two dimensions: Strings and byte arrays.

String Dimension: "Hello" 👋 Byte Array Dimension (UTF8 Encoded): [72, 101, 108, 108, 111]

Departure - Encoding to UTF8 :

👋 String "Hello" begins its journey to the other dimension from the String Dimension 🚌 The shuttle takes off and descents to Byte Array Dimension 🚪 Shuttle doors open to reveal UTF8 bytes [72, 101, 108, 108, 111]

Return - Decoding from UTF8 :

🔽 UTF8 encoded Byte array [72, 101, 108, 108, 111] braces itself for the return journey from Byte Array Dimension 🚀 The shuttle lifts off and ascents to the String Dimension 🚪 Shuttle doors open to drop off the decoded String "Hello" 👋

This visualization helps you remember that the shuttle (conversion) operates round trips — we encode strings to go down to the land of bytes, and decode byte arrays to go back up to the String Dimension.

Battling the edge cases

Edge cases. They don't always play fair. Here are some keys to victory:

  • When dealing with user inputs or external data, don't forget to wear your armor of encoding validation to avoid an ambush by MalformedInputException.
  • In the battle of efficiency, caching the Charset instance in a private final field is your secret weapon.
  • When exploring uncharted territories like file systems or network resources, always fly your charset flag explicitly.

Stay on the winning side. Code reliability and compatibility are your ultimate allies.