Sharing advanced skills of regular expression

Time:2021-1-8

Regular expression (ABR. Regex) is powerful and can be used to find the information in a large string of characters. It uses the conventional character structure expression to function. Unfortunately, simple regular expressions are far from enough for some advanced applications. If the structure of filtering is complex, you may need to use advanced regular expressions.

This article introduces the advanced techniques of regular expressions. Eight commonly used concepts are screened out and analyzed with examples. Each example is a simple way to meet some complex requirements. If you don’t know the basic concepts of regularity, please read this article, or this tutorial, or wiki entry first.

The regular syntax here applies to PHP and is compatible with Perl.

1. Greed / laziness

All regular operators that can be qualified many times are greedy. They match the target string as much as possible, which means that the result is as long as possible. Unfortunately, this is not always what we want. So we add the “lazy” qualifier to solve the problem. Adding “? After each greedy operator makes the expression match only the shortest possible length. In addition, the modifier “U” can also lazy operators that can be qualified multiple times. Understanding the difference between greed and laziness is the basis of using advanced regular expressions.

Greedy operator
The operator matches the previous expression zero or more times. It’s a greedy operator. Here’s an example:

Copy codeThe code is as follows:
preg_ Match (‘/ < H1 >. < / H1 > /’ < H1 > this is a title. < /h1>
< H1 > this is another one. < /h1> ‘ $matches )

A period (.) can represent any character except a newline character. The regular expression above matches the H1 tag and everything within it. It uses a period (.) and an asterisk () to match everything in the tag. The results are as follows

1. < H1 > this is a title. < / H1 > < H1 > this is another one. < /h1>
The entire string is returned. The operator matches everything continuously – even the H1 closed tag in the middle. Because it’s greedy, matching the whole string is in line with its profit maximization principle.

Lazy operator
A little modification of the above formula and a question mark (?) can make the expression lazy

1./< h1> .?< /h1> /
In this way, it will feel that it only needs to match the first H1 ending tag to complete the task.

Another greedy operator with similar properties is {n}. It represents that the previous matching pattern is repeated N times or more. If no question mark is added, it will look for as many repetitions as possible. If it is added, it will repeat as few times as possible (of course, “repeat n times” is the least).

Copy codeThe code is as follows:
#Create string
$str = ‘ hihihi oops hi’
#Match with greedy {n} operator
preg_ Match (‘/ (HI) {2} /’ $STR $matches) # matches [0] will be ‘Hi’
#Matching with the degenerated {n}? Operator
preg_ Match (‘/ (HI) {2}? /’ $STR $matches) # matches [0] will be ‘hihi’

2. Back referencing

What’s the usage?
Back referencing is generally translated as “back reference”, “backward reference” and “backward reference”. I think “back reference” is more appropriate. It is a way to capture the content before the internal reference of a regular expression. For example, the purpose of the following simple example is to match the contents inside quotation marks:

Copy codeThe code is as follows:
#Building a matching array
$matches = array()

#Create string
$str = ” ” this is a ‘ string’ ” ”

#Capturing content with regular expressions
preg_match( ” /(” |’ ).?(” |’ )/” $str $matches )

#Output the whole matching string
echo $matches[0]

It outputs:

1.” this is a’
Obviously, this is not what we want.

This expression starts with a double quotation mark at the beginning and ends the match incorrectly after encountering a single quotation mark. This is because the expression says: (“| ‘), that is, double quotation marks (“) and single quotation marks (‘). To fix this problem, you can use the return reference. Expression 1 2 9 is the group serial number of the previously captured sub contents, which can be referenced as a “pointer” to these groups. In this case, the first matched quotation mark is represented by 1.

How to use it?
In the above example, replace the closing quotation mark with 1:

1.preg_match( ‘ /(” |’ ).?1/’ $str $matches )
This returns the string correctly:

1.” this is a ‘ string’ ”
Thinking about translation and annotation:

If it is a Chinese quotation mark, the front quotation mark and the back quotation mark are not the same character, what should I do?

Remember the PHP function preg_ Replace? There are also references. It’s just that we didn’t use 1 9, but for $1 $9 … $n (any number here) is used as the return pointer. For example, if you want to replace all paragraph labels < p > with text:

Copy codeThe code is as follows:
$text = preg_replace( ‘ /< p> (.?)< /p> /’
” & lt p& gt $1& lt /p& gt ” $html )

The parameter $1 is a callback reference that represents the text inside the paragraph label < p > and is inserted into the replaced text. This easy-to-use expression writing method provides us with a simple way to get matched text, even when replacing text.

3. Named groups
It’s easy to confuse things when you use callback references many times in an expression. You need to figure out the numbers (1 9) It’s a very troublesome thing to represent which sub content. An alternative to callback references is to use named capture groups (hereinafter referred to as “named groups”). A named group is set by (? P < name > pattern). Name stands for the group name, and pattern is the regular structure matching the named group. Here’s an example:

1./(?p< quote> ” |’ ).?(?p=quote)/
In the above formula, quote is the group name, and “|” is the regular matching content. The following (? P = quote) is a named group named quote in the calling group. The effect of this formula is the same as that of the callback reference example above, except that it is implemented by using a named group. Is it easier to read and understand?

Named groups can also be used to process the internal data of an array of matched contents. Given a specific regular group name can also be used as the index word of the matched content in the array.

Copy codeThe code is as follows:
preg_match( ‘ /(?p< quote> ” |’ )/’ ” ‘ string’ ” $matches )

#The following statement outputs “‘” (excluding double quotes)
echo $matches[1]

#The group name call will also output “‘”
echo $matches[‘ quote’ ]

So, a named group doesn’t just make it easier to write code, it can also be used to organize code.

4. Word boundaries

A word boundary is the position between the word characters (including letters, numbers, underscores, and naturally Chinese characters) and non word characters in a string. What’s special is that it doesn’t match a real character. Its length is zero. B match all word boundaries.

Unfortunately, word boundaries are generally ignored, and most people don’t care about their practical significance. For example, if you want to match the word “import”:

1./import/
Attention! Regular expressions can be naughty. The following string can also match the above formula successfully:

1.important
You may think that as long as you add spaces before and after import, you can’t match this independent word

1./ import /
What if this happens

1.the trader voted for the import
When the word import is at the beginning or end of a string, the modified expression still cannot be used. Therefore, it is necessary to consider various situations:

1./(^import | import | import$)/i
Don’t panic. It’s not over yet. What about punctuation? In order to match this word, your regularization may need to write as follows:

1./(^import(:| | )? | import(:| | )? | import(.|?|!)?$)/i
It’s a bit of a fight to match just one word. Because of this, the word boundary is of great significance. To meet the above requirements, and many other variations, with character boundaries, the code we need to write is as follows:

1./bimportb/
All of the above situations have been solved. The flexibility of B is that it is a match without length. It only matches the imaginary position between two actual characters. It checks whether two adjacent characters are single word and non single word. If the situation matches, the match will be returned. If you encounter the beginning or end of a word, B will treat it as a non word character. Since I in import is still regarded as a word character, import is matched.

Note that, as opposed to B, we also have B, which matches the position between two words or two non words. Therefore, if you want to match “Hi” within a word, you can use:

1.bhib
“This” and “high” will return a match, while “Hi there” will return a mismatch.

5. Atomic groups

The minimum group is a special regular expression group without capture. It is usually used to improve the efficiency of regular expressions, and can also be used to eliminate specific matches. A minimum group can be defined by (? > pattern), where pattern is a matching expression.

1./(?> his|this)/
When the regularization engine matches the smallest group, it skips the backtracking position of the tags in the group. Take the word “smashing” as an example. When matching with the above regular expression, the regular engine will first try to find “his” in “smashing”. Obviously, no match was found. At this point, the smallest group works: the regularization engine discards all backtracking positions. That is to say, it will not try to find “this” from “smashing”. Why set it like this? Because “his” did not return the matching result, the “this” containing “his” certainly could not match!

The above example is not practical, we can achieve the effect with / T? His? /. Let’s look at the following example:

1./b(engineer|engrave|end)b/
If “Engineering” is used to match, the regularization engine will match “engineer” first, but then it will encounter the word boundary, B, so the match is not successful. Then, the regular engine tries to find the next match in the string: engrave. When matching to eng, the following ones are not right again, and the matching fails. Finally, try “end”, the result is also a failure. If you look closely, you will find that once the matching of engineer fails and both of them reach the word boundary, it is impossible for the two words “engineer” and “end” to match successfully. These two words are shorter than engineer, so regular engine should not try more.

1./b(?> engineer|engrave|end)b/
The above alternative writing method can save the matching time of the regular engine and improve the efficiency of the code.

6. Recursion

Recursion is used to match nested structures, such as bracket nesting, (this (that)), HTML tag nesting < div > < div > < / div > < / div >. We use (? R) to represent the sub pattern in the recursive process. Here is an example of matching nested brackets:

1./(((?> [^()]+)|(?r)))/
The outermost bracket “(” with an antonym matches the beginning of the nested structure. Then there is a multi option operator (|), which may match all the characters except brackets “(? > [^ ()] +)”, or it may match the whole expression again through the sub pattern “(? R)”. Note that this operator matches as many nesting as possible.

Another example of recursion is as follows:

1./< ([w]+).?> ((?> [^< > ]+)|((?r)))< /1> /
The above expression combines character grouping, greedy operator, backtracking and minimization group to match nested tags. The first bracket group ([w] +) matches the signature, which is used for the next application. If the angle bracket style label is found, try to find the rest of the label content. The next bracketed subexpression is very similar to the previous example: it either matches all characters (? > [^ < >] +) that do not include angle brackets, or recursively matches the entire expression (? R). The last part of the whole expression is the angle bracket style closed label < / 1 >.

7. Callbacks

Sometimes the specific content in the matching result may need some special modification. To apply multiple and complex modifications, regular expression callbacks have their place. Callbacks are used for the function preg_ replace_ The way to dynamically modify a string in a callback. You can do it for preg_ replace_ Callback specifies a function as a parameter. This function can receive the matching result array as a parameter, and modify the array to return it as a replacement result.

For example, we want to capitalize all the letters in a string. Unfortunately, PHP doesn’t have a regular operator to convert letter case directly. To do this, you can use regular callbacks. First, the expression should match all the letters that need to be capitalized

1./bw/
The above formula uses both word boundary and character class. This formula is not enough. We need a callback function

Copy codeThe code is as follows:
function upper_case( $matches ) {
return strtoupper( $matches[0] )
}

Function upper_ Case receives the array of matching results and converts the whole matching result into uppercase. In this case, $matches [0] represents the letters that need to be capitalized. Then, we use preg_ replace_ Callback implementation callback:

1.preg_replace_callback( ‘ /bw/’ ” upper_case” $str )
A simple callback is so powerful.

8. Commenting

Annotations are not used to match strings, but they are really the most important part of regular expressions. When regularization is written more and more deeply and more complex, it becomes more and more difficult to deduce what is matched. Adding comments in the middle of regular expressions is the best way to minimize confusion and confusion in the future.

To annotate a regular expression, use the (?) comment format. Replace “comment” with your comment statement:

1. / (?) d/
If you’re going to make your code public, it’s important to annotate regular expressions. This makes it easier for others to understand and modify your code. Just like the comments on other occasions, it’s also convenient for you to revisit the programs you’ve written before.

Consider using the “X” or “(? X)” modifier to format comments. This modifier causes the regular engine to ignore spaces between expression parameters. “Useful” spaces can still be matched by [] or (antonym plus space).

Copy codeThe code is as follows:
/
d #digit
[ ] #space
w+ #word
/x

The above code works the same as the following formula:

1./d(?#digit)[ ](?#space)w+(?#word)/
Always pay attention to the readability of the code.

Mode modifier
It is a function enhanced and supplemented for regular expression, which is used outside of regular expression
Example: regular / u represents a pattern modifier
The following are commonly used in PHP: (Note: case sensitive)
I regular content is case insensitive when matching (default is case sensitive)
M uses multi line recognition to match the first content or the last content
S will escape carriage return to cancel matching for units

X ignores white space in regularization
A force match from scratch
D force $d to match anything at the end
U prohibit greedy Mei matching, only trace to the nearest match and end, commonly used in regular expression collection program