Regular expression advanced skills and examples


The original English text is from Smashing Magazine. Translated by stupid workers. Please indicate the source of the reprint.

Regular expression,abbr. regex)Powerful, can be used to find the required information in a large string of characters. It uses the conventional character structure expression to function. Unfortunately, simple regular expressions are far from enough for some advanced applications. If the structure of the filter is complex, you may need to use itAdvanced regular expressions

This article is for youIntroduce the advanced skills of regular expression. We sift out eight commonly used concepts and analyze them with examples. Each example is a simple way to meet some complex requirements. If you don’t know the basic concept of regularity, please read it firstThis articleOrThis tutorialOrWiki entry

The regular syntax here applies to PHP, andPerlCompatible.

1. Greed / laziness


All regular operators that can be qualified many times are greedy. theyas many as you canIn other words, the result of the match will be invalidAs long as possible. Unfortunately, this is not always what we want. So we add the “lazy” qualifier to solve the problem. Add “? After each greedy operator Can make expressions match onlyAs short as possibleThe length of the. In addition, the modifier “U” can also lazy operators that can be qualified multiple times. Understanding the difference between greed and laziness is the basis of using advanced regular expressions.

Greedy operator

The * operator matches the previous expression zero or more times. It’s a greedy operator. Here’s an example:

preg_ Match ('/ < H1 >. * < \ / H1 > /', '< H1 > this is a title</ H1 > < H1 > this is another one</ h1>', $matches );

A period (.) can represent any character except a newline character. The regular expression above matches the H1 tag and everything within it. It uses a period (.) and an asterisk (*) to match everything in the tag. The results are as follows

<h1>This is a title</ H1 > < H1 > this is another one</ h1>

The entire string is returned* The operator matches everything continuously – even the H1 closed tag in the middle. Because it’s greedy, matching the whole string is in line with its profit maximization principle.

Lazy operator

Modify the above formula slightly and add a question mark (?), Can make expressions lazy:


In this way, it will feel that it only needs to match the first H1 ending tag to complete the task.

Another greedy operator with similar properties is {n,}. It represents that the previous matching pattern is repeated N times or more. If no question mark is added, it will look for as many repetitions as possible. If it is added, it will repeat as few times as possible (of course, “repeat n times” is the least).

#Create the string $STR ='Hi oops hi '# Using greedy {n,} operators to match preg_ match( '/(hi){2,}/', $str, $matches );   # Matches [0] will be 'Hi' # using the degenerated {n,}? Operators match preg_ match( '/(hi){2,}?/', $ str, $matches );   # Matches [0] will be 'hihi'
2. Back referencing

What's the usage?
Back referencing is generally translated as "back reference", "backward reference" and "backward reference". I think "back reference" is more appropriate. It is a way to capture the content before the internal reference of a regular expression. For example, the purpose of the following simple example is to match the contents inside quotation marks:
#Create a matching array $matches = array()# Create the string $STR = "this is a 'string'"# Capturing content preg with regular expressions_ match( "/(\"|').*?(\"|')/", $ str, $matches ); # Output the whole matching string echo   $ matches[0];

It outputs:
"This is a'
Obviously, this is not what we want.
This expression starts with a double quotation mark at the beginning and ends the match incorrectly after encountering a single quotation mark. This is because the expression says: ("| '), that is, double quotation marks (") and single quotation marks ('). To fix this problem, you can use the return reference. Expressions ⁃ 1, ⁃ 2,... And ⁃ 9 are the group numbers of the previously captured sub contents, which can be referenced as "pointers" to these groups. In this case, the first quotation mark to be matched is represented by the ⁃ 1.
How to use it?
In the above example, replace the closing quotation mark with 1:
preg_match( '/(\"|').*?\1/', $str, $matches );
This returns the string correctly:
"This is a 'string'"
Thinking about translation and annotation:
If it is a Chinese quotation mark, the front quotation mark and the back quotation mark are not the same character, what should I do?
Remember the PHP function preg_ Replace? There are also references. It's just that we didn't use the number of return pointers, instead we used $1... $9... $n (any number here). For example, if you want to replace all paragraph labels with text:
$text = preg_replace( '/<p>(.*?)</p>/', "&lt;p&gt;$1&lt;/p&gt;", $html );
The parameter $1 is a return reference that represents the text inside the paragraph label < p > and is inserted into the replaced text. This easy-to-use expression writing method provides us with a simple way to get matched text, even when replacing text.
3. Named groups
When callback references are used many times in an expression, it's easy to confuse things. It's troublesome to find out which sub content those numbers (1... 9) represent. An alternative to callback references is to use named capture groups (hereinafter referred to as "named groups"). In this paper, we use (?)? P < name > pattern), name represents the group name, and pattern is the regular structure for the group. Here's an example:
In the above formula, quote is the name of the group, and "|" is the regular name of the matching content. The following (?)? P = quote) is a named group named quote in the calling group. The effect of this formula is the same as that of the callback reference example above, except that it is implemented by using a named group. Is it easier to read and understand?
Named groups can also be used to process the internal data of an array of matched contents. Given a specific regular group name can also be used as the index word of the matched content in the array.
preg_ match( '/(? P<quote>"|\')/', "'String'", $matches ); # The following statement outputs "'" (excluding double quotes) echo $matches [1]# The group name call will also output "'" echo $matches ['quote'];
So, a named group doesn't just make it easier to write code, it can also be used to organize code.
4. Word boundaries

A word boundary is the position between the word characters (including letters, numbers, underscores, and naturally Chinese characters) and non word characters in a string. What's special is that it doesn't match a real character. Its length is zero\ B match all word boundaries.
Unfortunately, word boundaries are generally ignored, and most people don't care about their practical significance. For example, if you want to match the word "import":
Attention! Regular expressions can be naughty. The following string can also match the above formula successfully:
You may think that as long as you add spaces before and after import, you can't match this independent word
/ import /
What if this happens
The trader voted for the import
When the word import is at the beginning or end of a string, the modified expression still cannot be used. Therefore, it is necessary to consider various situations:
/(^import | import | import$)/i
Don't panic. It's not over yet. What about punctuation? In order to match this word, your regularization may need to write as follows:
/(^import(:|;|,)? | import(:|;|,)? | import(\.|\?|\!)?$)/i
It's a bit of a fight to match just one word. Because of this, the word boundary is of great significance. To meet the above requirements, and many other variations, with character boundaries, the code we need to write is as follows:
All of the above situations have been solved\ The flexibility of B is that it is a match without length. It only matches the imaginary position between two actual characters. It checks whether two adjacent characters are single word and non single word. If the situation matches, the match will be returned. If you encounter the beginning or end of a word, B will treat it as a non word character. Since I in import is still regarded as a word character, import is matched.
Note that, as opposed to ⁃ B, we also have ⁃ B, which matches the position between two single words or two non single words. Therefore, if you want to match "Hi" within a word, you can use:
"This" and "high" will return a match, while "Hi there" will not return a match.
5. Atomic groups

The minimum group is a special regular expression group without capture. It is usually used to improve the efficiency of regular expressions, and can also be used to eliminate specific matches. A minimum group can be defined as (? >) Pattern), where pattern is the matching expression.
When the regularization engine matches the smallest group, it skips the backtracking position of the tags in the group. Take the word "smashing" as an example. When matching with the above regular expression, the regular engine will first try to find "his" in "smashing". Obviously, no match was found. At this point, the smallest group works: the regularization engine discards all backtracking positions. That is to say, it will not try to find "this" from "smashing". Why set it like this? Because "his" did not return the matching result, the "this" containing "his" certainly could not match!
The above example is not practical. Let's use / T? His/ Can also achieve the effect. Let's look at the following example:

If "Engineering" is used to match, the regularization engine will match "engineer" first, but then it will encounter the word boundary, so the matching is not successful. Then, the regular engine tries to find the next match in the string: engrave. When matching to eng, the following ones are not right again, and the matching fails. Finally, try "end", the result is also a failure. If you look closely, you will find that once the matching of engineer fails and both of them reach the word boundary, it is impossible for the two words "engineer" and "end" to match successfully. These two words are shorter than engineer, so regular engine should not try more.
The above alternative writing method can save the matching time of the regular engine and improve the efficiency of the code.
6. Recursion

Recursion is used to match nested structures, such as bracket nesting, (this (that)), HTML tag nesting < div > < div > < / div > < / div >. We use (?)? R) To represent the sub pattern in the recursive process. Here is an example of matching nested brackets:
The outermost bracket "\ (" with an antonym matches the beginning of the nested structure. Then there is a multi option operator (* | *), which may match all characters except brackets "(? > [^ ()] +)", It may also be through the sub pattern "(? R) To match the entire expression again. Note that this operator matches as many nesting as possible.
Another example of recursion is as follows:
The above expression combines character grouping, greedy operator, backtracking and minimization group to match nested tags. The first bracket group ([w] +) matches the signature, which is used for the next application. If the angle bracket style label is found, try to find the rest of the label content. The next subexpression enclosed in brackets is very similar to the previous example: either match all characters that do not include angle brackets? > [^ < >] +, Or recursively match the entire expression? R)。 The < / 1 > at the end of the expression represents a closed label.
7. Callbacks

Sometimes the specific content in the matching result may need some special modification. To apply multiple and complex modifications, regular expression callbacks have their place. Callbacks are used for the function preg_ replace_ The way to dynamically modify a string in a callback. You can do it for preg_ replace_ Callback specifies a function as a parameter. This function can receive the matching result array as a parameter, and modify the array to return it as a replacement result.
For example, we want to capitalize all the letters in a string. Unfortunately, PHP doesn't have a regular operator that converts letter case directly. To do this, you can use regular callbacks. First, the expression should match all the letters that need to be capitalized
The above formula uses both word boundary and character class. This formula is not enough. We need a callback function
function upper_case( $matches ) { return strtoupper( $matches[0] ); }

Function upper_ Case receives the array of matching results and converts the whole matching result into uppercase. In this case, $matches [0] represents the letters that need to be capitalized. Then, we use preg_ replace_ Callback implementation callback:
preg_replace_callback( '/\b\w/', "upper_case", $str );
A simple callback is so powerful.
8. Commenting

Annotations are not used to match strings, but they are really the most important part of regular expressions. When regularization is written more and more deeply and more complex, it becomes more and more difficult to deduce what is matched. Adding comments in the middle of regular expressions is the best way to minimize confusion and confusion in the future.
To annotate a regular expression, use (?)# Comment) format. Replace "comment" with your comment statement:
/(?# Number (d)/
If you're going to make your code public, it's important to annotate regular expressions. This makes it easier for others to understand and modify your code. Just like the comments on other occasions, it's also convenient for you to revisit the programs you've written before.
Consider using "X" or "(? x) "Modifier to format the comment. This modifier causes the regular engine to ignore spaces between expression parameters“ Useful "white space" can still be matched by [] or / s, or (antonym plus white space).
/ \d    #digit [ ]   #space \w+   #word /x
The above code works the same as the following formula:
/\d(?#digit)[ ](?#space)\w+(?#word)/
Always pay attention to the readability of the code.
More resources Comprehensive website on regular expressions
Cheat SheetInformative regular expressions cheat sheet
Regex GeneratorJavaScript regular expressions generator
About the author Karthik Viswanathan is a high school student who likes programming and doing websites. You can check out his work on his blog: lateral code. You can also follow his online twitter app.

Recommended Today

What is “hybrid cloud”?

In this paper, we define the concept of “hybrid cloud”, explain four different cloud deployment models of hybrid cloud, and deeply analyze the industrial trend of hybrid cloud through a series of data and charts. 01 introduction Hybrid cloud is a computing environment that integrates multiple platforms and data centers. Generally speaking, hybrid cloud is […]