Regular expression basic tutorial and description

Time:2021-11-23

preface
Regular expressions are cumbersome, but powerful. The application after learning will not only improve your efficiency, but also bring you an absolute sense of achievement. As long as you carefully read these materials and make certain references when applying, mastering regular expressions is not a problem.
Indexes
one   Introduction
At present, regular expressions have been widely used in many software, including * Nix (Linux),   The shadow of regular expressions can be seen in UNIX, HP and other operating systems, PHP, c#, Java and other development environments, as well as many application software.
The use of regular expressions can achieve powerful functions through simple methods. In order to be simple, effective and powerful, regular expression code is difficult to learn, so it needs to make some efforts. After getting started, refer to a certain reference, which is still relatively simple and effective.
Example:  ^[email protected]+\\..+$ 
Such code has scared myself many times. Maybe many people are scared away by such code. Continuing to read this article will give you the freedom to apply such code.
Note: Part 7 here seems to repeat the previous contents. The purpose is to re describe the parts in the previous table in order to make these contents easier to understand.
two   History of regular expressions
The “ancestor” of regular expressions can be traced back to the early study of how the human nervous system works. Warren   McCulloch   and   Walter   Pitts   The two neurophysiologists developed a mathematical way to describe these neural networks.
      1956   In,   One named   Stephen   Kleene   Mathematicians in   McCulloch   and   Pitts   Based on his early work, he published an article entitled “neural network”   The paper introduces the concept of regular expression. Regular expression is an expression used to describe what he calls “algebra of regular sets”   This uses the term “regular expression”.
Subsequently, it is found that this work can be applied to use   Ken   Thompson   Some early studies of computational search algorithms, Ken   Thompson   yes   Unix   The main inventor of. The first practical application of regular expressions is   Unix   Medium   qed   Editor.
As they say, the rest is well-known history. Since then and until now, regular expressions have been an important part of text-based editors and search tools.
three   Regular expression definition
Regular expression   Expression) describes a pattern of string matching, which can be used to check whether a string contains a seed string, replace the matched substring, or take out the substring that meets a certain condition from a string.
When listing directories, dir  *. Txt or LS  *. The *. TXT in txt is not a regular expression, because the meaning of * here is different from that of the regular expression.  
Regular expressions are composed of ordinary characters, such as characters   a   reach   z) And a text pattern composed of special characters (called metacharacters). Regular expressions are used as a template to match a character pattern with the searched string.
  3.1   Ordinary character
Consists of all printed and non printed characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase alphabetic characters, all numbers, all punctuation, and some symbols.  
  3.2   Non printing character
character    meaning  
\cx    Matches the control character indicated by X. For example,  \ cM   Match one   Control-M   Or carriage return. x   The value of must be   A-Z   or   a-z   one of. Otherwise, it will   c   As an original  ‘ c’   Character.  
\f    Match a page feed. Equivalent to  \ x0c   and  \ cL。  
\n    Match a newline character. Equivalent to  \ x0a   and  \ cJ。  
\r    Match a carriage return. Equivalent to  \ x0d   and  \ cM。  
\s    Matches any white space characters, including spaces, tabs, page breaks, and so on. Equivalent to   [  \ f\n\r\t\v]。  
\S    Matches any non whitespace characters. Equivalent to   [^  \ f\n\r\t\v]。  
\t    Match a tab. Equivalent to  \ x09   and  \ cI。  
\v    Match a vertical tab. Equivalent to  \ x0b   and  \ cK。  
 
  3.3   Special characters
The so-called special characters are characters with special meaning, such as * in “*. TXT” above. In short, they mean any string. If you want to find a file with * in the file name, you need to escape * by adding a \. ls  \*. txt。 Regular expressions have the following special characters.
Special character   explain  
$   Matches the end of the input string. If set   RegExp   Object   Multiline   Property, then  $  Also match  ‘\ n’   or  ‘\ r’。 To match  $  Character itself, please use  \$。 
(  )  Marks the beginning and end of a subexpression. Subexpressions can be obtained for later use. To match these characters, use  \ (   and  \)。 
*   Matches the previous subexpression zero or more times. To match  *  Characters, please use  \*。 
+   Matches the previous subexpression one or more times. To match  +  Characters, please use  \+。 
.   Match division newline  \ Any single character other than n. To match  ., Please use  \。 
[    Marks the beginning of a bracket expression. To match   [, please use  \ [。  
?   Matches the previous subexpression zero or once, or indicates a non greedy qualifier. To match  ?  Characters, please use  \?。 
\   Marks the next character as a or special character, or literal character, or backward reference, or octal escape character. For example,  ‘ n’   Match character  ‘ n’。’\ n’   Match newline. sequence  ‘\\’  matching   “\”, and  ‘\ (‘   Then match   “(“。  
^   Matches the starting position of the input string, unless used in a square bracket expression, where it indicates that the character set is not accepted. To match  ^  Character itself, please use  \^。 
{   Mark the beginning of the qualifier expression. To match   {, please use  \ {。  
|   Indicates a choice between two items. To match  |, Please use  \|。 
Regular expressions are constructed in the same way as mathematical expressions. That is to combine small expressions with a variety of metacharacters and operators to create larger expressions. The components of a regular expression can be a single character, a character set, a character range, a selection between characters, or any combination of all these components.  
 
  3.4   qualifier
Qualifiers are used to specify how many times a given component of a regular expression must appear to satisfy a match. Yes * or + or? Or {n} or {n,} or {n, m}.
*, + and? Qualifiers are greedy because they will match as many words as possible, only adding one after them? We can achieve non greedy or minimum matching.
Qualifiers of regular expressions are:
character    describe  
*    Matches the previous subexpression zero or more times. For example, Zo*   Can match   “z”   as well as   “zoo”。*   Equivalent to {0,}.  
+    Matches the previous subexpression one or more times. For example, ‘Zo +’   Can match   “zo”   as well as   “Zoo”, but cannot match   “z”。+   Equivalent to   {1,}。  
?    Matches the previous subexpression zero or once. For example, “do (ES)”   Can match   “do”   or   “does”   “Do” in  。?  Equivalent to   {0,1}。  
{n}    n   Is a nonnegative integer. Match determined   n   Times. For example, ‘O {2}’   Cannot match   “Bob”   Medium  ‘ O ‘, but it can match   “food”   Two of   o。  
{n,}    n   Is a nonnegative integer. Match at least n   Times. For example, ‘O {2,}’   Cannot match   “Bob”   Medium  ‘ O ‘, but it can match   “foooood”   All in   o。’ o{1,}’   Equivalent to  ‘ o+’。’ o{0,}’   Is equivalent to  ‘ o*’。  
{n,m}    m   and   n   Are non negative integers, where n  <=  m。 Least match   n   Times and matches at most   m   Times. For example, “o{1,3}”   Will match   “fooooood”   The first three in   o。’ o{0,1}’   Equivalent to  ‘ o?’。 Please note that there can be no space between comma and two numbers.  
  3.5   Locator
Used to describe the boundary of a string or word, ^ and $refer to the beginning and end of the string respectively, \ B describe the front or back boundary of a word, and \ B represents a non word boundary. Qualifiers cannot be used on locators.  
  3.6   choice
Enclose all options with parentheses and separate adjacent options with |. However, there is a side effect of using parentheses, that is, the relevant matches will be cached, which can be used at this time?: Put the first option to eliminate this side effect.
Including?: Is one of the non capture elements, and the other two non capture elements are= And?!, These two have more meanings. The former is a positive pre query, which matches the search string at any position where the regular expression pattern in parentheses is matched at the beginning, and the latter is a negative pre query, which matches the search string at any position where the regular expression pattern is not matched at the beginning.  
  3.7   Backward reference
     Adding parentheses around a regular expression pattern or partial pattern will cause the relevant matches to be stored in a temporary buffer, and each captured sub match will be in accordance with the internal conditions encountered from left to right in the regular expression pattern    Storage capacity. The buffer number from which the child matches is stored   one   Start, number continuously until the maximum   ninety-nine   Sub expression. Each buffer can be used  ‘\ n’   Access, where   n   Identify a specific   One or two decimal digits of the buffer.
You can use non captured metacharacters  ‘?:’, ‘?=’,  or  ‘?!’  To ignore the saving of related matches.  
four   Operation priority of various operators
Operations with the same priority are performed from left to right, and operations with different priorities are performed from high to low. The priority of various operators from high to low is as follows:
Operator    describe  
\    Escape character  
(),   (?:),   (?=),   []    parentheses & square brackets  
*,  +, ?,  {n},   {n,},   {n,m}    qualifier  
^,  $, \ anymetacharacter    Location and sequence  
|    Or operation  
five   Interpretation of all symbols
character    describe  
\    Marks the next character as a special character, or a literal character, or a   Backward reference,, or an octal escape character. For example, ‘n’   Match character   “n”。’\ n’   Match a newline character. sequence  ‘\\’  matching   “\”   and   “\(”   Then match   “(“。  
^    Matches the start of the input string. If set   RegExp   Object   Multiline   Properties^   Also match  ‘\ n’   or  ‘\ r’   Position after.  
$    Matches the end of the input string. If regexp is set   Object   Multiline   Properties$   Also match  ‘\ n’   or  ‘\ r’   Previous position.  
*    Matches the previous subexpression zero or more times. For example, Zo*   Can match   “z”   as well as   “zoo”。*   Equivalent to {0,}.  
+    Matches the previous subexpression one or more times. For example, ‘Zo +’   Can match   “zo”   as well as   “Zoo”, but cannot match   “z”。+   Equivalent to   {1,}。  
?    Matches the previous subexpression zero or once. For example, “do (ES)”   Can match   “do”   or   “does”   “Do” in  。?  Equivalent to   {0,1}。  
{n}    n   Is a nonnegative integer. Match determined   n   Times. For example, ‘O {2}’   Cannot match   “Bob”   Medium  ‘ O ‘, but it can match   “food”   Two of   o。  
{n,}    n   Is a nonnegative integer. Match at least n   Times. For example, ‘O {2,}’   Cannot match   “Bob”   Medium  ‘ O ‘, but it can match   “foooood”   All in   o。’ o{1,}’   Equivalent to  ‘ o+’。’ o{0,}’   Is equivalent to  ‘ o*’。  
{n,m}    m   and   n   Are non negative integers, where n  <=  m。 Least match   n   Times and matches at most   m   Times. For example, “o{1,3}”   Will match   “fooooood”   The first three in   o。’ o{0,1}’   Equivalent to  ‘ o?’。 Please note that there can be no space between comma and two numbers.  
?     When this character follows any other qualifier   (*,  +, ?,  {n},   {n,},   {n,m})   Later, the matching pattern is non greedy. Non greedy patterns match as little as possible   The default greedy pattern matches as many strings as possible. For example, for Strings   “oooo”,’o+?’   Will match a single   “O”, and  ‘ o+’   Will horse   With all  ‘ o’。  
.    Matching Division   “\n”   Any single character other than. To match include  ‘\ n’   For any characters, please use the image  ‘ [.\n]’   Mode.  
(pattern)     matching   pattern   And get this match. The obtained match can be generated from the   Matches   Set, in VBScript   Used in   SubMatches    Collection, in JScript   Used in  $ 0…$9   Properties. To match parenthesis characters, use  ‘\ (‘   or  ‘\)’。 
(?:pattern)    matching    pattern   However, the matching result is not obtained, that is, it is a non obtained match and will not be stored for future use. This is in use   “Or”   character   (|)   To combine the parts of a pattern is   Very useful. For example,  ‘ industr(?:y|ies)   It’s a comparison  ‘ industry|industries’   A simpler expression.  
(?=pattern)     Forward pre check, in any match   pattern   Matches the lookup string at the beginning of the string. This is a non fetched match, that is, the match does not need to be fetched for later use. For example,  ‘ Windows   (?=95|98|NT|2000)’   Can match   “Windows   2000”   Medium   “Windows”  , But it can’t match    “Windows   3.1”   Medium   “Windows”。 The pre check does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match,   Instead of starting after the character containing the pre query.  
(?!pattern)    Negative prefetch at any mismatch   pattern   Matches the lookup string at the beginning of the string. This is a   A non fetch match, that is, the match does not need to be fetched for later use. For example, ‘windows   (?!95|98|NT|2000)’   Can match    “Windows   3.1”   Medium   “Windows” but cannot match   “Windows   2000”   Medium   “Windows”。 Pre check does not consume characters, so   That is, after a match occurs, the search for the next match starts immediately after the last match, rather than after the characters containing the pre query  
x|y    matching   x   or   y。 For example, ‘z|food’   Can match   “z”   or   “food”。’ (z|f)ood’   Then match   “zood”   or   “food”。  
[xyz]    Character set. Match any character contained. For example,  ‘ [abc]’   Can match   “plain”   Medium  ‘ a’。  
[^xyz]    Negative character set. Matches any characters that are not included. For example,  ‘ [^abc]’   Can match   “plain”   ‘p’ in.  
[a-z]    Character range. Matches any character within the specified range. For example, ‘[A-Z]’   Can match  ‘ a’   reach  ‘ z’   Any lowercase character in the range.  
[^a-z]    Negative character range. Matches any character that is not within the specified range. For example, ‘[^ A-Z]’   Can match any not in  ‘ a’   reach  ‘ z’   Any character in the range.  
\b    Match a word boundary, that is, the position between the word and the space. For example,  ‘ er\b’   Can match “never”   Medium  ‘ Er ‘, but cannot match   “verb”   Medium  ‘ er’。  
\B    Matches non word boundaries. ‘ er\B’   Can match   “verb”   Medium  ‘ Er ‘, but cannot match   “never”   Medium  ‘ er’。  
\cx    Match by   x   The specified control character. For example,  \ cM   Match one   Control-M   Or carriage return. x   The value of must be   A-Z   or   a-z   one of. Otherwise, it will   c   As an original  ‘ c’   Character.  
\d    Matches a numeric character. Equivalent to   [0-9]。  
\D    Matches a non numeric character. Equivalent to   [^0-9]。  
\f    Match a page feed. Equivalent to  \ x0c   and  \ cL。  
\n    Match a newline character. Equivalent to  \ x0a   and  \ cJ。  
\r    Match a carriage return. Equivalent to  \ x0d   and  \ cM。  
\s    Matches any white space characters, including spaces, tabs, page breaks, and so on. Equivalent to   [  \ f\n\r\t\v]。  
\S    Matches any non whitespace characters. Equivalent to   [^  \ f\n\r\t\v]。  
\t    Match a tab. Equivalent to  \ x09   and  \ cI。  
\v    Match a vertical tab. Equivalent to  \ x0b   and  \ cK。  
\w    Matches any word characters that include underscores. Equivalent to ‘[a-za-z0-u9]’.  
\W    Matches any non word characters. Equivalent to  ‘ [^A-Za-z0-9_]’。  
\xn    matching   n. Among them   n   Is a hexadecimal escape value. Hexadecimal escape value must be two digits long. For example, ‘\ x41′   matching   “A”。’\ x041’   Is equivalent to  ‘\ x04’  &  “1”。 Can be used in regular expressions   ASCII   Code  
\num    matching   Num, where   num   Is a positive integer. A reference to the match obtained. For example, ‘(.) \ 1’   Matches two consecutive identical characters.  
\n    Identifies an octal escape value or a backward reference. If  \ n   Before at least   n   Gets a subexpression, then   n   Is a backward reference. Otherwise, if   n   Is an octal digit   (0-7), then   n   Is an octal escape value.  
\nm     Identifies an octal escape value or a backward reference. If  \ nm   At least before   nm   Gets a subexpression, then   nm   Is a backward reference. If  \ nm   At least before   n   One acquisition,   be   n   Is a text followed by   m   Backward reference of. If none of the above conditions are met, if   n   and   m   All octal digits   (0-7), then  \ nm   Will match octal escape values    nm。  
\nml    If   n   Is an octal digit   (0-3), and   m   and   l   All octal digits   (0-7), the octal escape value is matched   nml。  
\un    matching   n. Among them   n   Is represented by four hexadecimal digits   Unicode   Character. For example,  \ u00A9   Match copyright symbol   (?)。  
six   Some examples
regular expression   explain  
/\b([a-z]+)  \ 1\b/gi   A continuous position of a word  
/(\w+):\/\/([^/:]+)(:\d*)? ([^#  ]*)/   Resolves a URL to a protocol, domain, port, and relative path  
/^(?:Chapter|Section)   [1-9][0-9]{0,1}$/   Position chapter  
/[-a-z]/   26 letters from a to Z plus a – sign.  
/ter\b/   Can match chapter, not terminal  
/\Bapt/   Can match chapter, not aptitude  
/Windows(?=95  | ninety-eight  | NT  )/  It can match Windows95, Windows98 or WindowsNT. When a match is found, the next search match will be carried out from behind windows.  
seven   Regular expression matching rule
  7.1   Basic pattern matching
Everything starts with the most basic. Patterns are the most basic elements of normal expressions. They are a group of characters that describe the characteristics of strings. Patterns can be very simple, consisting of ordinary strings, or very complex. Special characters are often used to represent a range of characters, repeated occurrences, or context. For example:
  ^once 
     This pattern contains a special character ^, which means that the pattern matches only those strings beginning with once. For example, the pattern is the same as the string “once”   upon   a   “Time” matches, with ”   “There   once   was   a   man   from   NewYork ” does not match. Just as the ^ symbol indicates the beginning, the $symbol is used to match characters that end in a given pattern   String.
  bucket$ 
This model is different from “who”   kept   all   of   this   cash   in   a   Bucket “does not match” buckets “. When the characters ^ and $are used together, it indicates an exact match (the string is the same as the pattern). For example:
  ^bucket$ 
Only match the string “bucket”. If a pattern does not include ^ and $, it matches any string containing the pattern. For example: mode
  once 
And string
  There once was a man from NewYork
  Who kept all of his cash in a bucket.
It’s a match.
     The letters (o-n-c-e) in this mode are literal characters, that is, they represent the letter itself, and the numbers are the same. Other slightly more complex characters, such as punctuation and white characters (empty)    Case, tab, etc.), to use escape sequence. All escape sequences begin with a backslash (\). The escape sequence for tabs is: \ t. So if we want to check whether a string starts with a tab, we can    Use this mode:
  ^\t 
Similarly, use \ n for “new line” and \ r for carriage return. Other special symbols can be preceded by a backslash. For example, the backslash itself is represented by \, the period is represented by \, and so on.
  7.2   Character cluster
In Internet programs, regular expressions are usually used to verify user input. When the user submits a form, it is not enough to use ordinary literal characters to judge whether the entered telephone number, address, email address, credit card number, etc. are valid.
Therefore, a more free way to describe the pattern we want is character cluster. To create a character cluster representing all vowel characters, put all vowel characters in a square bracket:
  [AaEeIiOoUu] 
This pattern matches any vowel character, but can represent only one character. Hyphens can be used to represent the range of a character, such as:
  [a-z]  // Match all lowercase letters  
  [A-Z]  // Match all uppercase letters  
  [a-zA-Z]  // Match all letters  
  [0-9]  // Match all numbers  
  [0-9\.\-]  // Match all numbers, periods and minus signs  
  [  \ f\r\t\n]  // Match all white characters  
Similarly, these only represent one character, which is a very important. If you want to match a string consisting of a lowercase letter and a number, such as “Z2”, “T6” or “G7”, but not “AB2”, “r2d3”   Or “B52”, use this mode:
  ^[a-z][0-9]$ 
Although [A-Z] represents a range of 26 letters, here it can only match a string whose first character is lowercase.
As mentioned earlier, ^ represents the beginning of a string, but it has another meaning. When ^ is used in a set of square brackets, it means “not” or “exclude”, which is often used to eliminate a character. Using the previous example, we require that the first character cannot be a number:
  ^[^0-9][0-9]$ 
This pattern matches “& 5”, “G7” and “- 2”, but does not match “12” and “66”. Here are some examples of excluding specific characters:
  [^a-z]  // All characters except lowercase letters  
  [^\\\/\^]  // All characters except (\) (/) (^)  
  [^\”\’]  // All characters except double quotation marks (“) and single quotation marks (‘)  
Special character “.”   (dot, period) used in regular expressions to represent all characters except “new line”. So the pattern “^. 5 $” matches any two character string that ends with the number 5 and starts with other non “new line” characters. The pattern “.” can match any string except an empty string and a string that contains only a “new line”.
PHP regular expressions have some built-in general character clusters, as shown below:
Character cluster meaning  
  [[:alpha:]]   Any letter  
  [[:digit:]]   Any number  
  [[:alnum:]]   Any letters and numbers  
  [[:space:]]   Any white character  
  [[:upper:]]   Any capital letter  
  [[:lower:]]   Any lowercase letter  
  [[:punct:]]   Any punctuation  
  [[:xdigit:]]   Any hexadecimal digit, equivalent to [0-9a-fa-f]  
  7.3   Determine recurrence
By now, you already know how to match a letter or number, but more often, you may have to match a word or a group of numbers. A word consists of several letters and a group of numbers consists of several singular numbers. Curly braces ({}) following a character or character cluster are used to determine the number of repetitions of previous content.  
Character cluster   meaning  
  ^[a-zA-Z_]$   All letters and underscores  
  ^[[:alpha:]]{3}$   All three letter words  
  ^a$   Letter a  
  ^a{4}$ aaaa 
  ^a{2,4}$   AA, AAA or AAAA  
  ^a{1,3}$   a. AA or AAA  
  ^a{2,}$   A string containing more than two a’s  
  ^a{2,}   Such as Aardvark and aaab, but apple can’t  
  a{2,}   Such as baad and AAA, but Nantucket can’t  
  \t{2}   Two tabs  
  .{2}   All two characters  
     These examples describe three different uses of curly braces. A number, {x} means “the preceding character or character cluster appears only x times”; A number plus a comma, {x,} means “the previous content is out    Now X or more times “; Two numbers separated by commas, {x, y} means “the previous content appears at least x times, but not more than y times”. We can extend the pattern to more words or numbers:
  ^[a-zA-Z0-9_]{1,}$  // All strings that contain more than one letter, number, or underscore  
  ^[0-9]{1,}$  // All positive numbers  
  ^\-{0,1}[0-9]{1,}$  // All integers  
  ^\-{0,1}[0-9]{0,}\.{0,1}[0-9]{0,}$  // All decimals  
     The last example is not easy to understand, is it? Here’s how it works: with all numbers starting with an optional minus sign (\ – {0,1}) (^), followed by 0 or more numbers ([0-9] {0,}), and one    The optional decimal point (\. {0,1}) is followed by 0 or more numbers ([0-9] {0,}), and there is nothing else ($). Below you will know the simpler methods you can use.
The special characters “? And {0,1} are equal. They both represent:” 0 or 1 previous contents “or” previous contents are optional “. So the example just now can be simplified to:
  ^\-?[0-9]{0,}\.?[0-9]{0,}$ 
The special character “*” is equal to {0,} and both represent “0 or more preceding contents”. Finally, the characters “+” and   {1,} are equal, indicating “one or more previous contents”, so the above four examples can be written as:
  ^[a-zA-Z0-9_]+$  // All strings that contain more than one letter, number, or underscore  
  ^[0-9]+$  // All positive numbers  
  ^\-? [0-9]+$  // All integers  
  ^\-? [0-9]*\.? [0-9]*$  // All decimals  
Of course, this does not technically reduce the complexity of regular expressions, but it can make them easier to read.