Using Unicode to match special characters in regular expressions


Using Unicode to match special characters in regular expressions

[email protected]zwhu

Original [email protected]github

First of all, I declare that all the code in this paper is run under ES6, and Es5 needs to be modified before it can run. However, this paper does not involve too many new features of ES6, and because V8 does not support the U modifier, the final implementation is basically written with the knowledge of Es5.

At first, I just wanted to record that regular expressions match special characters in the way of Unicode. It was written that V8 did not support the U modifier, and then I turned to study how to convert strings to utf-16 format. In the process of studying how to convert, I found that the regular expressions of Es5 did not support strings with Unicode encoding units > 0x10000, and then I turned to realize the conversion of strings greater than 10 000 The conversion of the string of 0x10000 is hereby recorded.

I’ve met the need for a practical regular expression to match special characters, such as a piece of text'AB * CD $Hello, I'm good, too] \ nseg $me * ntfault, nhello, world', the user can choose to use * or $to split the string.

In JavaScript,$and*They are pre-defined special characters, which cannot be written directly in regular expressions. Instead, they need to be escaped and written as/\$/perhaps/\*/
We need to write a regular expression according to the user’s choice, which is encapsulated as a function

function reg(input) {
    return new RegExp(`\${input}`)

This kind of writing looks very beautiful at first. After escaping all characters, some special characters can be matched. However, the reality is cruel: when the user enters thenperhapstFor this class of characters, the return regular expression is/\n/perhaps/\t/All tabs match, which goes against the user’s original intention.

Usually, there is a way to list all the special characters that need to be escaped, and then match them one by one. This way of writing is very energy-consuming, and there may be missing matching because there are no special characters counted.

At this time, Unicode is on the stageJavaScriptFor example, ‘a’ can be written as’ \ u {61} ‘, and’ you ‘can be written as’ \ u {4f60}’.

About the introduction of Unicode, you can seeDetailed explanation of Unicode and JavaScript

ES5Provided incharCodeAt()Method to return the Unicode value of the character at the specified index, except for Unicode encoding unit > 0x10000,ES2015A new method has been added incodePointAt()Can return a numeric value greater than 0x10000 string. The value returned is decimal. At this time, we need to pass thetoString(16)Convert to hexadecimal.
The encapsulated function is as follows

function toUnicode(s) {
    return `\u{${s.codePointAt().toString(16)}}`

toUnicode('$') -> '\u{24}'

Repackage reg function as

function reg(input) {
    return new RegExp(`${toUnicode(input)}`, 'u')

In fact, I hope I’m right when I write here, but unfortunately, V8 doesn’t support regexp’s u modifier. If V8 supports it, it should end here. It doesn’t matter. It just provides a way to escape special characters by using Unicode.

Although V8 doesn’t support the U modifier, as a coder with pursuit, we can’t stop here. We can also use other methods to improve it

function toUnicode(s) {
  var a = `\u${utf(s.charCodeAt(0).toString(16))}`
    a = `${a}\u${utf(s.charCodeAt(1).toString(16))}` 
  return a      

function utf(s) {
    return Array.from('00').concat(Array.from(s)).slice(-4).join('')

//Here, VaR is used instead of let because the code can be directly copied to the chrome console to see the execution result
//Test it
// toUnicode('a')   --> "\u0061"
// toUnitcode('?')  --> "\ud842\udfb7"

function reg(input) {
    return new RegExp(`${toUnicode(input)}`)
//Test it again
reg('$').test('$') --> true

Finished, see the last students, if you think it’s helpful to you, then point a recommendation.