Regular expression front end user manual

Time:2020-9-27

Reading guide

Have you ever racked your brains searching for text, tried one expression after another, and still couldn’t

Have you ever made a look (as long as it is not empty) when verifying the form, and then burn incense to worship Buddha and pray devoutly. Don’t make mistakes

Have you ever used the SED and grep commands that you don’t know what metacharacters you should support, but you just can’t match them

Even, you never encounter the above situation, you just call replace again and again (replace all non search text with empty, and then only search text). In the face of other people’s concise and efficient sentences, you can only shout in your heart, and replace is good

Why do you want to learn regular expressions? A netizen said: in the legend of the lake, the programmer’s regular expression is as famous as the doctor’s prescription and the Taoist’s ghost rune, saying that ordinary people can’t understand the three artifacts. This legend at least reveals two information to us: first, the regular expression is very good, which can be as famous as the doctor’s prescription and Taoist’s ghost symbol, and is mentioned by everyone, which shows its status in the river and lake Second, regular expression is very difficult, which also shows from the side that if you can master and apply it skillfully, you will be in the ascendant on the road of pretending to force (don’t ask me who Zhongtian is !

Obviously, there is no need for me to say more about regular expression. Here, I will throw a brick with the preface of Jeffrey Friedl’s “Mastering regular expression”

“If we list the great inventions in the field of computer software, I believe there will be no more than 20. In this list, of course, we should include such famous guys as packet switching network, web, lisp, hash algorithm, UNIX, compilation technology, relational model, object-oriented, XML, and regular expressions should never be missed

For many practical work, regular expression is simply a panacea, which can improve the development efficiency and program quality hundreds of times. The key role of regular expression in bioinformatics and human gene mapping research has been widely spread. When Mr. Jiang Tao, the founder of CSDN, developed professional software products in his early years, I have experienced the power of this tool, and have been impressed“

Therefore, we have no reason not to understand the regular expression, or even to master and use it

This paper starts with the regular basic grammar, and explains the principle of regular expression matching step by step with specific examples. The code examples use languages including JS, PHP, python, Java (because some matching patterns are not supported by JS, they need to be explained by other languages). The content includes elementary skills and high-level skills, which are suitable for novice learning and advanced. This paper strives to be simple, easy to understand, and comprehensive, More knowledge involved, a total of 12K words, a long space, please read patiently, if you have dyslexia, please contact me in time

Looking back on history

The origin of regular expressions can be traced back to the early study of how the human nervous system works. Warren McCulloch and Walter Pitts, two neurophysiologists, have developed a mathematical way to describe these neural networks

In 1956, a mathematician named Stephen Kleene published a paper entitled “representation of neural network events” based on the early work of McCulloch and Pitts, which introduced the concept of regular expression

Later, it was found that this work could be applied to some early researches using Ken Thompson’s computational search algorithm. Ken Thompson was also the main inventor of UNIX. Therefore, the QED editor in UNIX half a century ago (the QED editor came out in 1966) became the first application to use regular expressions

Since then, regular expression has become a well-known text processing tool. Almost every major programming language takes supporting regular expression as a selling point, and JavaScript is no exception

Definition of regular expression

A regular expression is a text template composed of ordinary characters and special characters (also called metacharacters or qualifiers). Here is a simple regular expression that matches consecutive numbers

/[0-9]+/
/\d+/

“D” is a metacharacter and “+” is a qualifier

Metacharacter

Metacharacter describe
. Matches any character except newline
\d Matching number, equivalent to character group [0-9]
\w Match letters, numbers, underscores or Chinese characters
\s Match any whitespace (including tab, space, newline, etc.)
\b Matches the beginning or end of a word
^ Match line headers
$ Match end of line

Antisense metacharacter

Metacharacter describe
\D Matches any character that is not a number, equivalent to [^ 0-9]
\W Matches any character except letters, numbers, underscores or Chinese characters
\S Matches any character that is not blank
\B Matches where a non word begins or ends
[^x] Match any character except x

As you can see, regular expressions are case sensitive

Repeat qualifier

There are 6 qualifiers in total. Assuming that the number of repetitions is x times, then there will be the following rules:

qualifier describe
* x>=0
+ x>=1
? x=0 or x=1
{n} x=n
{n,} x>=n
{n,m} n<=x<=m

Character group

[…] matches one of the characters in brackets. For example: [XYZ] matches the characters x, y or Z. if the bracket contains metacharacters, the metacharacter is degraded to a normal character and no longer has the metacharacter function, such as [+.?] matching plus sign, period mark or question mark

Exclusive character set

[^… ]Matches any unlisted characters, such as: [^ x] matches any character except X

Multi choice structure

|A | B matches a or B characters

brackets

Parentheses are often used to define the scope of a repeating qualifier, as well as to change the charactergroupingFor example: (AB) + can match ABAB.. etc., where AB is a group

Escape character

Escape characters, usually*+? | {[()]} ^ $. #and blankThese characters need to be escaped

Operation priority of operator

  1. Escape character

  2. (), (?:), (? =), [] parentheses or square brackets

  3. *, +,?, {n}, {n,}, {n, m} qualifier

  4. ^, $location

  5. |Or operation

test

Let’s test the above knowledge and write a regular expression to match the mobile phone number, as follows:

(\+86)?1\d{10}

① “\ + 86” matches the text “+ 86”, followed by a metacharacter question mark, indicating that it can be matched for 1 or 0 times. In combination, it means “(\ + 86)?” matches “+ 86” or “”

② The normal character “1” matches the text “1”

③ The metacharacter “{ D” matches numbers 0 to 9, the interval Quantifier “{10}” represents 10 matches, and together it means “{ 10}” matches 10 consecutive numbers

The matching results are as follows:

Regular expression front end user manual

Modifier

By default, regular expressions in JavaScript have the following five modifiers:

  • G (full text search), as shown in the above screenshot, actually opens the full text search mode

  • I (ignore case lookup)

  • M (multiline lookup)

  • Y (adhesion modifier added to ES6)

  • U (ES6 new)

Common regular expressions

  1. Chinese characters: ^ [u4e00-u9fa5] {0,}$

  2. Email: ^w+([-+.]w+)*@w+([-.]w+)*\.w+([-.]w+)*$

  3. URL: ^https?://([w-]+.)+[w-]+(/[w-./?%&=]*)?$

  4. Mobile phone number: ^ 1D {10}$

  5. ID card number: ^ (D {15}| D {17} (d| x))$

  6. Postcode of China: [1-9] d {5} (?! d) (6-digit postcode)

Password verification

Password verification is a common requirement. Generally speaking, conventional passwords generally meet the following rules: 6-16 bits, numbers, letters and characters contain at least two types, and cannot contain Chinese and spaces. The following is a regular description of conventional password verification:

var reg = /(?!^[0-9]+$)(?!^[A-z]+$)(?!^[^A-z0-9]+$)^[^\s\u4e00-\u9fa5]{6,16}$/;

Regular families

Classification of regular expressions

Under Linux and OSX, there are at least three common regular expressions:

  • Basic regular expression is also called basic regexBREs )

  • Extended regular expression is also called extended regexEREs )

  • Perl regular expression is also called Perl regexPREs )

Regular expression comparison

character explain Basic RegEx Extended RegEx python RegEx Perl regEx
Paraphrase
^ Match the beginning of a line, for example ‘^ dog’ matches a line that begins with a string dog (Note: in awk instructions’ ^ ‘is the beginning of the matching string) ^ ^ ^ ^
$ Match the end of a line, for example: ‘^, dog & dollar;’ matches the line ending with the string dog (Note: in awk instruction, ‘$’ is the end of the matching string) $ $ $ $
^$ Match blank lines ^$ ^$ ^$ ^$
^string$ Match rows, for example: ‘^ dog $’ matches rows with only one string dog ^string$ ^string$ ^string$ ^string$
< Match words, for example: ‘< frog’ (equivalent to ‘bfrog’), matches words that start with frog < < I won’t support it I won’t support it(but you can use B to match words, for example:’bfrog ‘)
> Match words, for example: ‘Frog >’ (equivalent to ‘frogb’), matches words ending with frog > > I won’t support it I won’t support it(but you can use B to match words, for example:’frogb ‘)
<x> Match a word or a specific character, for example: ‘< frog >’ (equivalent to ‘bfrogb’), ‘< g >’ <x> <x> I won’t support it I won’t support it(but you can use B to match words, for example: ‘bfrogb’
() Match expression, e.g. ‘(frog)’ is not supported I won’t support it(but you can use dog () () ()
Match expression, e.g. ‘(frog)’ is not supported I won’t support it(same as ()) I won’t support it(same as ()) I won’t support it(same as ())
Match the previous subexpression 0 or 1 times (equivalent to {0,1}), for example: where (is)? Can match “where” and “where is” I won’t support it(same)
? Match the previous subexpression 0 or 1 times (equivalent to ‘{0,1}’), for example, ‘where is?’ can match ‘where’ and ‘where is’ ? I won’t support it(same) I won’t support it(same) I won’t support it(same)
? When the character follows any other qualifier (*, +,?, {n}, {n,}, {n, m}), the matching pattern is non greedy. The non greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible. For example, for the string “oooo”,’O +? ‘will match a single “O”, while’ O + ‘will match all’ o ‘ I won’t support it I won’t support it I won’t support it I won’t support it
. Match any single character other than the newline character (‘n ‘) (Note: the period in the awk instruction can match the newline character) . (if you want to match any character including “n”, use: [SS] . (if you want to match any character including “n”, use: ‘[. N]’
* Match the previous subexpression 0 or more times (equivalent to {0,}), for example: Zo * can match “Z” and “zoo” * * * *
+ Match the previous subexpression one or more times (equivalent to ‘{1,}’), for example, ‘where is +’ can match ‘whereis’ and’ whereisis’ + I won’t support it(same as +) I won’t support it(same as +) I won’t support it(same as +)
+ Match the preceding subexpression one or more times (equivalent to {1,}), for example: Zo + can match “Zo” and “zoo”, but not “Z” I won’t support it(same as \ +) + + +
{n} N must be a 0 or a positive integer, matching the subexpression n times, for example: Zo {2} can match I won’t support it(same as {n}) {n} {n} {n}
{n,} “Zooz”, but cannot match “Bob” n. It must be a 0 or a positive integer, and the matching sub expression is greater than or equal to N times, for example: go {2,} I won’t support it(same as \ {n, \}) {n,} {n,} {n,}
{n,m} Can match “good”, but can not match. Both godm and N are nonnegative integers, where n < = m, matches at least N times and at most m times, for example: O {1,3} will match the first three o in “food” (note that there can be no space between comma and two numbers) I won’t support it(same as {n, m}) {n,m} {n,m} {n,m}
x l y Match X or Y I won’t support it(same as x l y x l y x l y x l y
[0-9] Match any numeric character from 0 to 9 [0-9] [0-9] [0-9] [0-9]
[xyz] A set of characters that match any of the included characters, for example: ‘[ABC]’ can match ‘a’ in ‘lay’ (Note: if metacharacters, such as… * and so on, are placed in [], they will become a normal character.) [xyz] [xyz] [xyz] [xyz]
[^xyz] A set of negative characters to match any character that is not included (Note: line breaks are not included). For example: ‘[^ ABC]’ can match ‘l’ in ‘lay’ (Note: [^ XYZ] in awk instruction, it matches any character not included + newline character) [^xyz] [^xyz] [^xyz] [^xyz]
[A-Za-z] Match any character in uppercase or lowercase letters [A-Za-z] [A-Za-z] [A-Za-z] [A-Za-z]
[^A-Za-z] Match any character except uppercase and lowercase letters [^A-Za-z] [^A-Za-z] [^A-Za-z] [^A-Za-z]
\d Matches any numeric character from 0 to 9 (equivalent to [0-9]) I won’t support it I won’t support it \d \d
\D Matches non numeric characters (equivalent to1 I won’t support it I won’t support it \D \D
\S Matches any non white space characters (equivalent to2 I won’t support it I won’t support it \S \S
\s Matches any white space characters, including spaces, tabs, page breaks, and so on (equivalent to [fnrtv]) I won’t support it I won’t support it \s \s
\W Matches any non word character (equivalent to3) \W \W \W \W
\w Matches any word character including an underline (equivalent to [a-za-z0-9_ ]) \w \w \w \w
\B Match non word boundaries, for example: ‘Erb’ matches’ er ‘in’ verb ‘, but cannot match’ er ‘in’ never ‘ \B \B \B \B
\b Match a word boundary, that is, the position between the word and the space. For example, ‘Erb’ can match ‘er’ in ‘never’, but cannot match ‘er’ in ‘verb’ \b \b \b \b
\t Matches a horizontal tab (equivalent to X09 and CI) I won’t support it I won’t support it \t \t
\v Match a vertical tab (equivalent to x0B and CK) I won’t support it I won’t support it \v \v
\n Match a newline character (equivalent to x0a and CJ) I won’t support it I won’t support it \n \n
\f Match a page feed (equivalent to x0c and CL) I won’t support it I won’t support it \f \f
\r Match a carriage return (equivalent to x0d and cm) I won’t support it I won’t support it \r \r
\ Match escape character itself ” \ \ \ \
cx Match the control character indicated by X. for example, CM matches a control-m or carriage return character. The value of X must be one of A-Z or A-Z. otherwise, C is regarded as an original ‘C’ character I won’t support it I won’t support it cx
xn Matches n, where n is a hexadecimal escape value. The hexadecimal escape value must be a certain two digit length, for example: ‘x41′ matches’ a ‘. ‘x041’ is equivalent to ‘X04’ & “1”. ASCII encoding can be used in regular expressions I won’t support it I won’t support it xn
num Matches num, where num is a positive integer. Represents a reference to the obtained match I won’t support it num num
[:alnum:] Matches any letter or number ([a-za-z0-9]), for example: ‘[[: alnum:]]]’ [:alnum:] [:alnum:] [:alnum:] [:alnum:]
[:alpha:] Match any letter ([a-za-z]), for example: ‘[[: alpha)]]’ [:alpha:] [:alpha:] [:alpha:] [:alpha:]
[:digit:] Match any number ([0-9]), for example: ‘[[: digit)]]’ [:digit:] [:digit:] [:digit:] [:digit:]
[:lower:] Matches any lowercase letter ([A-Z]), for example: ‘[[: lower)]]’ [:lower:] [:lower:] [:lower:] [:lower:]
[:upper:] Match any capital letter ([A-Z]) [:upper:] [:upper:] [:upper:] [:upper:]
[:space:] Any white space character: support tab character, space, for example: ‘[[: Space:]]’ [:space:] [:space:] [:space:] [:space:]
[:blank:] Spaces and tabs (horizontal and vertical), for example: ‘[[: blank]]’ ó ‘[STV]’ [:blank:] [:blank:] [:blank:] [:blank:]
[:graph:] Any character that can be seen and printed (Note: spaces and line breaks are not included), for example: ‘[[: graph:]]’ [:graph:] [:graph:] [:graph:] [:graph:]
[:print:] Any character that can be printed (Note: does not include: [: CNTRL:], string terminator ‘0’, EOF file Terminator (- 1), but includes spaces), for example: ‘[[: Print:]]’ [:print:] [:print:] [:print:] [:print:]
[:cntrl:] Any control character (the first 32 characters in the ASCII character set, i.e. decimal representation from 0 to 31, such as line feed, tab, etc.), for example: ‘[[: CNTRL:]]’ [:cntrl:] [:cntrl:] [:cntrl:] [:cntrl:]
[:punct:] Any punctuation mark (excluding the character sets: [: alnum:], [: CNTRL:], [: Space:]) [:punct:] [:punct:] [:punct:] [:punct:]
[:xdigit:] Any hexadecimal number (i.e. 0-9, A-F, A-F) [:xdigit:] [:xdigit:] [:xdigit:] [:xdigit:]

be careful

  • Eres is supported in JS

  • When using bres (basic regular expression), the following symbols (?, +, |, {,}, (,)) must be preceded by escape characters

  • The regular expression in the form of [[: XXX:]] is a built-in common character family in PHP, which is not supported in JS

The relationship between common commands and regular expressions in Linux / OSX

I have tried to write regular expressions in grep and sed commands. I often find that metacharacters can’t be used, and sometimes they need to be escaped, sometimes they don’t need to be escaped, so I still can’t find out its rules. If you happen to have the same confusion, please look down and believe that you can get something

Characteristics of grep, egrep, SED, awk regular expressions

Grep supports: bres, eres, pres regular expressions

  • Grep directive is not followed by any parameters, which means “bres” is to be used

  • Grep instruction followed by “- e” parameter indicates that “eres” is to be used“

  • The grep instruction followed by the “- P” parameter indicates that “pres” is to be used“

Egrep support: eres, pres regular expressions

  • “Eres” is used if the egrep instruction is not followed by any parameters

  • The egrep instruction followed by the “- P” parameter indicates that “pres” is to be used“

Sed support: bres, eres

  • The SED instruction uses “bres” by default

  • The SED instruction followed by the “- R” parameter indicates that “eres” is to be used“

Awk supports eres and uses “eres” by default

Elementary skills of regular expression

Greedy model and non greedy model

By default, all qualifiers are greedy mode, which means to capture as many characters as possible; while adding?, after the qualifier, is non greedy mode, which means to capture as few characters as possible

var str = "aaab",
    Reg1 = / A + /, // greedy mode
    Reg2 = / A +? /; // non greedy mode
console.log ( str.match (reg1)); // ["AAA"], because it is a greedy mode, all a's are captured
console.log ( str.match (reg2)); // ["a"], because it is a non greedy mode, only the first a is captured

In fact, the non greedy mode is very effective, especially when matching HTML tags. For example, if you match a div that appears in a pair, scheme 1 may match many div tag pairs, while scheme 2 will match only one div tag pair

var str = "<div class='v1'><div class='v2'>test</div><input type='text'/></div>";
Var reg1 = / < Div. * < \ / div > /; // scheme 1: greedy matching
Var reg2 = / < Div. *? < \ / div > /; // scheme 2, non greedy matching
console.log(str.match(reg1));//"<div class='v1'><div class='v2'>test</div><input type='text'/></div>"
console.log(str.match(reg2));//"<div class='v1'><div class='v2'>test</div>"
Non greedy model of interval quantifiers

In general, in the non greedy mode, we use “*”, or “+?”, and another is “{n, m}?”

The interval Quantifier “{n, m}” is also a matching priority. Although there is an upper limit of matching times, it is still matching as many as possible before reaching the upper limit, while “{n, m}?” means that there are as few matches as possible within the range

It should be noted that:

  • For greedy and non greedy patterns which can achieve the same matching results, greedy patterns usually have higher matching efficiency

  • All non greedy patterns can be converted to greedy patterns by modifying the subexpressions modified by quantifiers

  • The greedy model can be associated withSolidification group(we’ll talk about later) combining can improve matching efficiency, but not greedy mode

grouping

Regular grouping is mainly realized by parentheses. The subexpression wrapped by brackets is used as a grouping. The parenthesis can be followed by a qualifier to indicate the number of repetitions

/(abc)+/.test("abc123") == true

So what’s the use of grouping? Generally speaking, grouping is used to express the number of repetitions conveniently. In addition, it also has a function of capturing. Please read on

Capture grouping

The capture group is usually composed of a pair of parentheses and a subexpression. The capture group will create a reverse reference, and each reverse reference is identified by a number or name. In JS, the$+ numberperhaps\+NoThe following is an example of capture grouping

var color = "#808080";
var output =  color.replace (/ #ාාාාාාාාාාාාාාාාාාාා
console.log(RegExp.$1);//808080
console.log(output);//808080~~

Above, (D +) denotes a capturing group, and “regexp. & dollar; 1” points to the captured content of the packet$+ numberThis reference is usually used outside of regular expressions\+NoThis kind of reference can be used in regular expressions to match substrings of the same part in different positions

var url = "www.google.google.com";
var re = /([a-z]+)\./;
console.log(url.replace(re,"$1"));//"www.google.com"

Above, the same part of the “Google” string is replaced only once

Non capture grouping

The non capture group is usually composed of a pair of parentheses plus a “?:” and a subexpression. The non trapping group does not create a reverse reference, just as if there were no brackets

var color = "#808080";
var output = color.replace(/#(?:\d+)/,"$1"+"~~");
console.log(RegExp.$1);//""
console.log(output);//$1~~

Above, (?: D +) represents a non trapping group. Since the group does not capture anything, regexp. $1 points to an empty string
At the same time, since the reverse reference of $1 does not exist, it is eventually replaced as a normal string
In fact, there is no difference in search efficiency between capture and non capture groups, and none is faster than the other

Naming groups

Grammar: (? < name >…)

Named group is also a capturing group. It captures the matching string into a group name or number name. After obtaining the matching result, it can be obtained by the group name. The following is an example of Python named group

import re
data = "#808080"
regExp = r"#(?P<one>\d+)"
replaceString = "\g<one>" + "~~"
print re.sub(regExp,replaceString,data) # 808080~~

Compared with the standard format, the named group expression of Python has a capital P character after? And Python is referenced by the “g < named >” notation. (if it is a capture grouping, Python is referenced by the “g < number >” notation)

Unlike python, named groups are not supported in JavaScript

Solidification group

Solidification group, also known as atomic group

Grammar: (? >…)

As mentioned above, when we use non greedy patterns, we may perform multiple backtracking in the matching process. The more backtracking, the lower the efficiency of regular expression. The fixed grouping is used to reduce the number of backtracking

In fact, the curing group (? >) )The only difference is that when the group matching is finished, the text it matches has been solidified into a unit and can only be retained or abandoned as a whole, and the unused standby state in the subexpression in brackets will be discarded, So backtracking can never select the states (and therefore cannot participate in backtracking). Let’s take an example to better understand the fixed group

If you want to process a batch of data, the original format is 123.456, because of the floating-point number display problem, part of the data format will be changed to 123.4566000000789. Now only 2-3 digits after the decimal point are required, but the last digit cannot be 0, how to write this regular?

var str = "123.456000000789";
str = str.replace(/(\.\d\d[1-9]?)\d*/,"$1"); //123.456

In order to improve the efficiency, we change the last “*” to “+”, as follows:

var str = "123.456";
str = str.replace(/(\.\d\d[1-9]?)\d+/,"$1"); //123.45

At this time, the matching of “DD [1-9]?” subexpression is “45” instead of “456”. This is because “+” is used at the end of the regular, which means that at least one number must be matched at the end. Therefore, the sub expression “D +” at the end matches to “6”. Obviously, “123.45” is not our expected matching result, so what should we do? Can we not make backtracking once “[1-9]?” is successfully matched, Here we will use the above curing group

“(\. DD (? > [1-9]?)) d +” is the above regular fixed group form. Since the string “123.456” does not meet the regularity of the fixed group, the matching will fail and meet our expectation

Let’s analyze why the regular (\. DD (? > [1-9]?)) d + of the fixed group cannot match the string “123.456”

Obviously, there are only two matching results for the above curing groups

Case 1: if [1-9] fails to match, the regular will return to the standby state left by? And then the matching will break away from the solidified group and continue to advance to [D +]. When the control right leaves the solidified group, there is no standby state to give up (because there is no standby state created in the solidified group)

Case 2: if [1-9] is successfully matched, after the matching is separated from the solidification group, the saved standby state still exists. However, it will be discarded because it belongs to the finished solidification group

For the string “123.456”, because [1-9] can match successfully, it conforms to the situation. Next, we will restore the execution site of situation ②

  1. Matching status: matching has reached the position of “6”, and matching will continue to move forward; = = >

  2. If the subexpression D + is found to be unable to match, the regular engine will try to backtrack; = = >

  3. Check if there is a standby status for backtracking? = = >

  4. The standby status saved by “?” belongs to the solidification group that has been ended, so the standby state will be abandoned; = = >

  5. At this time, the “6” matched by the fixed group cannot be used for the backtracking of the regular engine

  6. Attempt backtracking failed; = = > 0

  7. Regular matching failed. = = >

  8. The text “123.456” was not matched by a regular expression, as expected

The corresponding flow chart is as follows:

Regular expression front end user manual

Unfortunately, JavaScript, Java and python do not support the solidified grouping syntax. However, it performs well in PHP and. Net. The following provides a regular expression in the form of fixed grouping in PHP version for you to try

$str = "123.456";
echo preg_ Replace ("/ (\. D / D (? > [1-9]?)), D + /", "\ \ 1", $STR); // solidify the group

In addition, PHP also provides possessive quantifier first syntax

$str = "123.456";
echo preg_ Replace ("/ (\. \ \ d [1-9]? +) \ D + /", "\ \ 1", $STR); // possessive quantifiers have priority

Although Java does not support fixed grouping syntax, Java also provides possessive quantifier first syntax, which can also avoid regular backtracking

String str = "123.456";
System.out.println(str.replaceAll("(\.\d\d[1-9]?+)\d+", "$1"));// 123.456

It is worth noting that the replaceall method in Java needs to escape the backslash

Regular expression high level skill zero width assertion

If the regular grouping is a wheel eye, then zero width assertion is the ultimate meaning of kaleidoscope writing wheel eye – Su Zuo Neng Hu (here to use Huoying Ninja as an example). Reasonable use of zero width assertion can group the impossible, greatly enhance the ability of regular matching, and even help you locate text quickly when the matching conditions are very fuzzy

Zero width assertion, also known as look around. Look only matches sub expressions, and the matched content is not saved to the final matching result. Because the matching is zero width, the final match is only one position

According to the direction, there are two kinds of look around: the order and the reverse order (also called forward-looking and backward looking). There are two kinds of positive and negative according to whether they match. The combination of them makes four kinds of look. The four kinds of look are not complicated, and they are described as follows

character describe Examples
(?:pattern) In other words, no reverse reference is created as if there were no brackets ‘ABCD (?: e) matches’ ABCDE’
(?=pattern) Sequence positive look aroundThe pattern position is followed by the matching, and the matching result is not captured ‘windows (? = 2000)’ matches’ windows’ in ‘Windows2000’; does not match ‘windows’ in’ windows3.1 ‘
(?!pattern) Sequential negative look aroundIs not the position of pattern after the match, and the matching result is not captured ‘windows (?! 2000)’ matches’ windows’ in ‘windows3.1’; does not match ‘windows’ in’ Windows2000 ‘
(?<=pattern) Looking around in reverse orderThe position of pattern is in front of the matching, and the matching result is not captured ‘(? < = Office) 2000’ matches “2000” in “office2000”; does not match “2000” in “Windows2000”
(?<!pattern) Negative look in reverse orderThe position before matching is not pattern, and the matching result is not captured ‘(? <! Office) 2000′ matches’ 2000 ‘in’ Windows2000 ‘; does not match’ 2000 ‘in’ office2000 ‘

Because the structure of non capture grouping is similar to look around, it is listed in the table for comparison. Among the above four kinds of look around, only the first two are supported in JavaScript, that is, only the first two are supportedSequence positive look aroundandSequential negative look aroundLet’s use examples to help understand:

var str = "123abc789",s;
//Instead of looking around, ABC is replaced directly
s = str.replace(/abc/,456);
console.log(s); //123456789

//The sequential positive look is used to capture the position in front of a, so ABC is not replaced, but 3 is replaced by 3456
s = str.replace(/3(?=abc)/,3456);
console.log(s); //123456abc789

//The sequential negative look is used. Because 3 is followed by ABC and the condition is not satisfied, the capture fails, so the original string is not replaced
s = str.replace(/3(?!abc)/,3456);
console.log(s); //123abc789

Let’s demonstrate it with PythonLooking around in reverse orderandNegative look in reverse orderThe use of

import re
data = "123abc789"
#The reverse order positive look is used to replace 123 consecutive lowercase letters on the left. The matching is successful, so ABC is replaced by 456
regExp = r"(?<=123)[a-z]+"
replaceString = "456"
print re.sub(regExp,replaceString,data) # 123456789

#Because the left side of the English letter cannot be 123, the subexpression [A-Z] + captures BC, and finally BC is replaced by 456
regExp = r"(?<!123)[a-z]+"
replaceString = "456"
print re.sub(regExp,replaceString,data) # 123a456789

Note: in Python and PerlLooking around in reverse orderFor example, the Python interpreter will report an error: “error: look behind requirements fixed width pattern”

Scene review

Get HTML fragment

Now, JS gets a piece of HTML code through Ajax, as follows:

var responseText = "<div data='dev.xxx.txt'></div><img src='dev.xxx.png' />";

Now we need to replace the “dev” string in the SRC attribute of img tag with the “test” string

① Since the above responseText string contains at least two substrings “dev”, it is obviously not possible to directly replace the string “dev” with “test”

② At the same time, because JS does not support reverse look, we can not judge the prefix as “SRC = ‘” in the regular, and then replace “dev”

③ We notice that the SRC attribute of the IMG tag ends with “. PNG”. Based on this, we can use the sequential affirmative look

Var reg = / dev (? = [^ '] * PNG) /; // to prevent matching the first dev, single quotation marks or angle brackets should be excluded before the wildcard
var str = responseText.replace(reg,"test");
console.log(str);//<div data='dev.xxx'></div><img src='test.xxx.png' />

Of course, the above is not only a solution of looking around in order, but also capturing grouping. So where is the advanced look? The location of look around advanced is that it can be located in a single capture. For complex text replacement scenarios, it often works wonders, but grouping requires more operations

Thousand bit separator

The thousand separator, as the name implies, is the comma in the number. Referring to the Western custom, a symbol is added to the number to avoid seeing its value intuitively because the number is too long. Therefore, a comma is added every three digits in the number, that is, the thousand separator

So how to convert a string of numbers into a thousand separator?

var str = "1234567890";
(+str).toLocaleString();//"1,234,567,890"

As above,toLocaleString()Returns the localized string form of the current object

  • If the object is of type number, a symbolic – separated string of the value is returned

  • If the object is of type array, each item in the array is converted to a string, and then the strings are concatenated with the specified separator and returned

toLocaleStringThe method is special and has localization feature. For China, the default separator is English comma. Therefore, it can be used to convert numerical value into thousand separator string. If internationalization is considered, the above method may fail

We try to use look around to deal with it

function thousand(str){
  return str.replace(/(?!^)(?=([0-9]{3})+$)/g,',');
}
console.log(thousand(str));//"1,234,567,890"
console.log(thousand("123456"));//"123,456"
console.log(thousand("1234567879876543210"));//"1,234,567,879,876,543,210"

The regularities used above are divided into two parts(?!^)and(?=([0-9]{3})+$)Let’s first look at the latter part and then analyze it step by step

  1. “[0-9] {3}” means three consecutive digits

  2. “([0-9] {3}) +” means three consecutive digits appear at least once or more times

  3. “([0-9] {3}) + $” represents a number of positive integer multiples of 3 until the end of the string

  4. that(?=([0-9]{3})+$)It matches a zero width position with a positive integer multiple of 3 from this position to the end of the string

  5. A regular expression uses a global match g, which means it will continue to match until it fails to match

  6. Replacing this position with a comma actually adds a comma for every three digits

  7. Of course, for the string “123456”, which has a positive integer multiple of 3, of course, you can’t add a comma before 1(?!^)The replacement position cannot be the starting position

Thousands of separator examples, show the powerful look, one step in place

Application of regular expression in JS

The extension of ES6 to regular

ES6 has two more modifiers for regular extensions (other languages may not support them)

  • Y (sticky modifier), similar to g, is also a global match, and the next match starts from the next position where the previous match was successful. The difference is that the G modifier only needs a match in the remaining positions, while the Y modifier ensures that the matching must start from the first remaining position

var s = "abc_ab_a";
var r1 = /[a-z]+/g;
var r2 = /[a-z]+/y;
console.log(r1.exec(s),r1.lastIndex); // ["abc", index: 0, input: "abc_ab_a"] 3
console.log(r2.exec(s),r2.lastIndex); // ["abc", index: 0, input: "abc_ab_a"] 3

console.log(r1.exec(s),r1.lastIndex); // ["ab", index: 4, input: "abc_ab_a"] 6
console.log(r2.exec(s),r2.lastIndex); // null 0

As shown above, since the starting position of the second match is subscript 3, the corresponding string is “1”_ “, and the regular object R2 with y modifier needs to start from the first remaining position, so the matching fails and returns null

The sticky property of a regular object indicates whether the Y modifier is set

  • The U modifier provides support for adding 4-byte code points to regular expressions. For example, the “? Character is a 4-byte character. If regular matching is used directly, it will fail. After using the U modifier, the correct result will be obtained

var s = "?";
console.log(/^.$/.test(s));//false
console.log(/^.$/u.test(s));//true

Ucs-2 bytecode

In terms of bytecode, JavaScript can only handle ucs-2 encoding (JS was designed by Brendan EICH in May 1995, which was more than a year earlier than the coding specification utf-16 released in July 1996, and only ucs-2 was optional at that time). Due to the inherent defects of ucs-2, all characters in JS are 2 bytes. If it is a 4-byte character, It will be treated as two double byte characters by default. Therefore, the character processing functions of JS are restricted and cannot return correct results

var s = "?";
console.log (s = = "\ \ ud834 / udf06"); // true? Is equivalent to 0xd834df06 in utf-16
console.log (s.length); // 2 the length is 2, indicating that this is a 4-byte character

Fortunately, ES6 can automatically recognize 4-byte characters. Therefore, the for of loop can be used directly for traversing strings. Meanwhile, if code points are directly used to represent Unicode characters in JS, there is no way to recognize 4-byte characters in Es5. Therefore, ES6 fixes this problem by placing the code points in braces

console.log (s = = = = \ "u1d306"); // false Es5 is not recognized?
console.log (s = = = = = \ \ u {1d306}); // true ES6 can be identified with braces?

Attachment: ES6 new 4-byte code processing function

  • String.fromCodePoint(): returns the corresponding character from the Unicode code point

  • String.prototype.codePointAt(): returns the corresponding code point from a character

  • String.prototype.at(): returns the character at the given position of a string

For the Unicode character set in JS, please refer to Ruan Yifeng’sUnicode and JavaScript.

On the other hand, from the perspective of methods, the methods related to regular expressions in JavaScript include:

Method name compile test exec match search replace split
Object RegExp RegExp RegExp String String String String

From the above, there are seven JS related methods, which come from regexp and string objects respectively. First, let’s take a look at regexp, a regular class in JS

RegExp

Regexp objects represent regular expressions and are mainly used to perform pattern matching on strings

Syntax: new regexp (pattern [, flags])

parameterpatternIs a string that specifies a regular expression string or other regular expression object

parameterflagsIs an optional string containing the attributes “g”, “I” and “m” that specify global, case sensitive, and multiline matches, respectivelypatternIs a regular expression, not a string, the parameter must be omitted

var pattern = "[0-9]";
var reg = new RegExp(pattern,"g");
//The above creation of regular expression objects can be replaced by the literal form of objects. The following is also recommended
var reg = /[0-9]/g;

Above, through object literal and constructor to create regular expression, there is a small episode

“ECMAScript 3 specifies that the same regexp object will be returned every time it is used, so regular expressions created with a direct will share an instance. It is not until ECMAScript 5 that different instances are returned each time.”

Therefore, we don’t have to worry about this problem now. We just need to pay attention to using constructors to create regularities in low version non IE browsers (in this regard, ie always complies with the provisions of Es5, and other browser’s low-level versions follow Es3)

Regexp instance object contains the following properties:

Instance properties describe
global Include global flag (true / false)
ignoreCase True / true flag (case sensitive)
multiline Whether to include multiline flag (true / false)
source Returns the text string form of the expression specified when the regexp object instance is created
lastIndex Represents the next position at the end of the matched string in the original string. The default value is 0
flags(ES6) Returns the modifier of a regular expression
sticky(ES6) Whether the Y (glue) modifier is set (true / false)

compile

The compile method is used to change and recompile regular expressions during execution

Syntax: compile (pattern [, flags])

Please refer to the regexp constructor above for parameter introduction

var reg = new RegExp("abc", "gi"); 
var reg2 = reg.compile("new abc", "g");
console.log(reg);// /new abc/g
console.log(reg2);// undefined

It can be seen that the compile method will change the original regular expression object and recompile, and its return value is null

test

The test method is used to detect whether a string matches a regular rule. As long as the string contains text matching the regular rule, the method returns true, otherwise it returns false

Syntax: Test (string), the usage is as follows:

console.log(/[0-9]+/.test("abc123"));//true
console.log(/[0-9]+/.test("abc"));//false

Above, the string “abc123” contains a number, so the test method returns true; while the string “ABC” does not contain a number, it returns false

If you need to use the test method to test whether the string matches a regular rule, you can add the start (^) and end ($) metacharacters to the regular expression

console.log(/^[0-9]+$/.test("abc123"));//false

Since the string “abc123” does not start with a number or end with a number, the test method returns false

In fact, if a regular expression has a global flag (with a parameter g), the test method is also affected by the lastindex property of the regular object, as follows:

Var reg = / [A-Z] + /; // regular without global flag
console.log(reg.test("abc"));//true
console.log(reg.test("de"));//true

Var reg = / [A-Z] + / g; // regular with global flag G
console.log(reg.test("abc"));//true
console.log ( reg.lastIndex ); // 3, the next time the test is run, the search will start from the position with index 3
console.log(reg.test("de"));//false

This influence will be analyzed in the explanation of exec method

exec

The exec method is used to detect the string matching the regular expression. If the matched text is found, a result array is returned, otherwise null is returned

Syntax: exec (string)

The array returned by the exec method contains two additional attributes, index and input

  • Item 0 represents the text captured by the regular expression

  • Items 1 ~ n refer to the 1st ~ nth reverse references, which point to the texts captured by groups 1 ~ n in turn. Regexp. $+ “No. 1 ~ n” can be used to obtain the texts in groups in turn

  • Index represents the initial position of the matching string

  • Input represents the string being retrieved

Whether a regular expression has a global “g” or not, the performance of exec is the same. However, the performance of regular expression objects is somewhat different. Let’s elaborate on the differences in the expression of regular expression objects

Suppose the regular expression object is REG and the detected character is string, reg.exec (string) the return value is array

If reg contains the global flag “g”, then reg.lastIndex Property represents the next position at the end of the matched string in the original string, where the next match begins reg.lastIndex = = array.index (start position of matching) + array [0]. Length (length of matching string)

var reg = /([a-z]+)/gi,
    string = "World Internet Conference";
var array = reg.exec(string);
console.log(array);//["World", "World", index: 0, input: "World Internet Conference"]
console.log(RegExp.$1);//World
console.log ( reg.lastIndex ); // 5, which is exactly equal to array.index  + array[0].length

As the search continues, array.index Will be incremented later, that is, reg.lastIndex Therefore, we can also call the exec method repeatedly to traverse all the matching text in the string. Until the exec method can no longer match the text, it will return null and set the reg.lastIndex Property reset to 0

Next, let’s continue to execute the code to see if the above is correct, as shown below:

array = reg.exec(string);
console.log(array);//["Internet", "Internet", index: 6, input: "World Internet Conference"]
console.log(reg.lastIndex);//14

array = reg.exec(string);
console.log(array);//["Conference", "Conference", index: 15, input: "World Internet Conference"]
console.log(reg.lastIndex);//25

array = reg.exec(string);
console.log(array);//null
console.log(reg.lastIndex);//0

In the above code, with repeated calls to the exec method, reg.lastIndex Property is eventually reset to 0

Problem review

In the explanation of the test method, we left a problem. If the regular expression has the global flag g, the execution result of the above test method will be affected by reg.lastIndex In addition, the exec method is also affected reg.lastIndex The value of is not always zero, and it determines where the next match begins, If you want to start retrieving new strings after a match in a string, you must manually reset the lastindex property to 0

var reg = /[0-9]+/g,
    str1 = "123abc",
    str2 = "123456";
reg.exec(str1);
console.log(reg.lastIndex);//3
var array = reg.exec(str2);
console.log(array);//["456", index: 3, input: "123456"]

The correct execution result of the above code should be “123456”. Therefore, it is suggested to add a sentence before executing the exec method the second time“ reg.lastIndex = 0;”.

If reg does not contain the global flag “g”, the execution result (array) of the exec method will be string.match The (reg) method performs exactly the same result

String

For match, search, replace and split methods, please refer toCommon methods of stringExplain in

The following shows the process of using capture grouping to process text templates and finally generate complete strings:

var tmp = "An ${a} a ${b} keeps the ${c} away";
var obj = {
  a:"apple",
  b:"day",
  c:"doctor"
};
function tmpl(t,o){
  return t.replace(/${(.)}/g,function(m,p){
    console.log('m:'+m+' p:'+p);
    return o[p];
  });
}
tmpl(tmp,obj);

The above functions can be realized by using ES6:

var obj = {
  a:"apple",
  b:"day",
  c:"doctor"
};
with(obj){
  console.log(`An ${a} a ${b} keeps the ${c} away`);
}

Application of regular expression in H5

The pattern attribute is added to H5 to specify the pattern used to verify the input field. The pattern matching of pattern supports the writing of regular expression. The default pattern attribute is all matching, that is, whether there is “^” and “$” metacharacters in the regular expression, it matches all the text

Note: pattern is applicable to the following input types: text, search, URL, telephone, email and password. If you need to cancel the form validation, add the novalidate attribute to the form tag

regex engine

At present, there are two kinds of regular engines, DFA and NFA. NFA can be divided into traditional NFA and POSIX NFA

  • DFA deterministic finite automaton

  • NFA non deterministic finite automaton

  • Traditional NFA

  • POSIX NFA

DFA engine does not support backtracking, fast matching, and capture groups, so it does not support reverse reference. The above awk and egrep commands support DFA engine

POSIX NFA mainly refers to the NFA engine that conforms to POSIX standard, such as JavaScript, Java, PHP, python, C ා, etc

As for the detailed matching principle of regular expressions, no suitable articles have been found on the Internet for the time being. It is recommended to read Chapter 4 of Jeffrey Friedl’s “mastery of regular expressions” [Third Edition] – the principle of expression matching (p143-p183). Jeffrey Friedl has a profound understanding of regular expressions, and I believe he can help you better learn regular expressions

For the simple implementation of NFA engine, you can refer to the article based on ε – NFA regular expression engine – twoon

summary

In the primary stage of learning regularization, we should understand ① greedy and non greedy patterns, ② grouping, ③ capturing and non capturing grouping, ④ naming grouping, and ⑤ solidifying grouping. In the advanced stage, we should be familiar with the principle of regular matching and skillfully use the zero width assertion (or look) to solve problems

In fact, the function of regular in JavaScript is not very powerful. JS only supports ① greedy and non greedy mode, ② grouping, ③ capturing and non capturing grouping, and ⑥ sequential look in zero width assertion. If you are more familiar with the seven regular related methods in JS (compile, test, exec, match, search, replace, split), Then processing text or string will be easy

Regular expression, which is very powerful in text processing, is often the only solution. Regular expression is not limited to JS, and popular editors (such as sublime, atom) and IDE (such as webstorm, IntelliJ idea) support it. You can even try to use regular to solve problems in any language at any time, Maybe the problems that can’t be solved before can be solved easily now

Regular information of other languages is attached

  • Python regular expression operation guide

  • Java regular expression


Author: Louis
In this article: This article is made up of 12K words intermittently for two months. In order to restore the regular usage rules in the front-end scene succinctly and comprehensively, a large number of regularization related data are collected, and many redundant words are eliminated. It is not easy to code words. Please give me a like or collection if you like. I will keep updating
Original address: http://louiszhai.github.io/20…

Reference articles

  • Jeffrey Friedl’s mastery of regular expressions

  • Comparison of Linux shell regular expressions (bres, eres, PRES)

  • Introduction to capture group / non capture group of regular expression_ Regular expression_ Script house

  • Regular expression (1) — metacharacter – counter – blog Garden

  • Regular expression explanation – guisu, program life. If you do not advance, you will retreat. -Blog channel- CSDN.NET

  • Fixed group of regular expressions taek blog Garden

  • A detailed explanation of greedy and non greedy patterns of regular expressions_ Regular expression_ Script house

  • JavaScript regular expression learning — > basics and zero width assertion (transferred from situ Zhengmei) – feather of the wind – blog channel- CSDN.NET

  • Detailed explanation of Unicode and JavaScript – Ruan Yifeng’s Weblog


  1. 0-9 ↩
  2. fnrtv ↩
  3. A-Za-z0-9_ ↩