Explain Codes LogoExplain Codes Logo

Regular expression to match non-ASCII characters?

javascript
unicode-properties
regex-engineering
browser-compatibility
Nikita BarsukovbyNikita Barsukov·Dec 8, 2024
TLDR

Detect non-ASCII characters conveniently with the regular expression /[^\x00-\x7F]+/:

const nonAsciiRegex = /[^\x00-\x7F]+/; // No non-ASCII character can escape this. console.log("Café • テスト".match(nonAsciiRegex)); // ["Café • テスト"]

This code snippet hunts for characters beyond the ASCII character set, such as international symbols and emoticons.

Unicode Handling in JavaScript

For coding in an international environment, JavaScript facilitates dealing with Unicode characters effectively. Characters beyond the basic ASCII set (0x7F) cater to diverse languages, special symbols, and even emojis.

Generics to specifics

To catch Unicode characters regardless of languages, one can use Unicode Property Escapes represented as \p{L}:

const unicodeRegex = /\p{L}+/u; // Catch me all the letters in the world! console.log("안녕하세요".match(unicodeRegex)); // ["안녕하세요"]

Notice the u flag, a best friend of Unicode characters.

Libraries to cover gaps

Fan of legacy code? XRegExp library has got your back. It extends JavaScript's regex adeptness, covers the Unicode properties even for older environments:

const XRegExp = require('xregexp'); let regex = XRegExp('\\p{L}+'); // Even your grandma could read Résumé. console.log(regex.test('Résumé')); // true

Customized matches

To target Unicode characters from specific languages such as Cyrillic or Chinese, custom ranges tailored from Unicode code charts come in handy:

const chineseCharsRegex = /[\u4e00-\u9fa5]+/; // Ni Hao, Chinese Characters! console.log("你好".match(chineseCharsRegex)); // ["你好"]

Browser Compatibility and Transpilation

Legacy Support

Before adventuring into the realms of Unicode regex, ensure friendly browser support. While recent versions welcome Unicode with open arms, some older ones might break your heart.

Transform it up

For those irreplaceable vintage JavaScript environments, transpile Unicode regex with trustable sidekicks such as regexpu or Babel:

npm install --save-dev babel-plugin-transform-unicode-regex // Invite the transformer home.

Add the specific plugin to your Babel configuration for the magic to unfold:

{ "plugins": ["transform-unicode-regex"] // All set to slay Unicode! }

Advanced Use Cases and Best Practices

Testing? Yes, please!

Ensure extensive testing of your regex logic with diverse and unpredictable data sets. After all, with power comes great responsibility.

Simplifying with Functions

Encapsulate the regex logic inside functions, keeping your code neat, readable and reusable. For example, a grabNonAscIIWords(text) function can work wonders:

function grabNonAscIIWords(text) { return text.match(/[\p{L}-]+/gu) || []; // Gotcha non-ASCII! } console.log(grabNonAscIIWords("El niño played σκάκι")); // ["niño", "σκάκι"]

Tools that help

Fine-tune your pattern in real-time using platforms like regex101, which further provides a breakdown of each component and its role, because we all appreciate a little help.