Explain Codes LogoExplain Codes Logo

Split string into array of character strings

java
unicode-hazards
regex-patterns
internationalization
Anton ShumikhinbyAnton Shumikhin·Mar 7, 2025
TLDR

To split a string into individual characters, use str.split("") in Java. Remove the initial empty string with substring(1):

String str = "hello"; String[] charStrings = str.substring(1).split("");

This yields: {"h", "e", "l", "l", "o"}, easily turning "hello" into an array where each character has its own space.

Splitting strings with regex and other techniques

When it comes to Java strings, they're not always as simple as splitting an array at every character. Surrogate pairs, Unicode, and other intricacies require special attention.

The case of surrogate pairs

Java strings can contain characters represented as surrogate pairs — a character sequence which often throws a shade at the simple split(""). When encountered, these pairs can break, leading to undesired outcomes.

String highFive = "Hi👏"; String[] brokenChars = highFive.split(""); // Hell hath no fury like a surrogate pair ignored!

Here comes str.codePoints() for the rescue—handling all characters with gusto:

String[] codePointsArray = highFive.codePoints() .mapToObj(cp -> String.valueOf(Character.toChars(cp))) .toArray(String[]::new);

Result: Surrogate pairs are kept intact, avoiding potential Unicode hazards.

Regex pattern for splitting

By taking advantage of regex patterns, we find (?!^) particularly useful for maintaining surrogate pairs intact with the split method.

String example = "split"; String[] splitArray = example.split("(?!^)");

This maintains the integrity of the characters.

Subtleties of different methods

Not all methods are created equal. Each has its unique behavior, and your use case defines which is the most suitable.

Sidestepping the first empty element

Using split("") might result in an unexpected first empty element. To mitigate this, you can use either substring(1) or, for a more systematic approach, use toCharArray():

char[] charArray = str.toCharArray(); String[] stringArray = new String[charArray.length]; for (int i = 0; i < charArray.length; i++) { stringArray[i] = String.valueOf(charArray[i]); }

Welcome to the Unicode world

Interactions with complex Unicode characters (like CJK ideographs and new emojis) call for code points instead of chars to avoid misinterpretations.

A dash of StringUtils

For a tastier approach, Apache Commons Lang's StringUtils simplify these operations and handles Unicode more elegantly:

String[] charStrings = StringUtils.splitByCharacterType(str);

The nitty-gritty of splitting strings

Use toCharArray for efficiency

When efficiency is paramount, toCharArray() outwits regex-based solutions in speed and elegance. As tastes differ, one's perceived clarity might be another's complexity.

Considering internationalization

Providing robust support for internationalization (i18n) is essential. By sidestepping split(""), we avoid issues with i18n and the nuances of different languages' character sets.

Regular expressions: Handle with care!

Regular expressions are a powerful tool, but like all power tools, they can be a source of complexity and performance issues. Use regex wisely and only when necessary.