A detailed introduction to regular expressions (I)


This is a translation of a tutorial written by Jan goyvaerts for regexbuddy. Let’s see!

1. What is regular expression

Basically, a regular expression is a pattern used to describe a certain amount of text. Regex stands for regular express. In this paper, we will use < < regex > > to represent a specific regular expression.

A piece of text is the most basic pattern, simply matching the same text.

2. Different regular expression engines

Regular expression engine is a kind of software that can deal with regular expressions. Typically, the engine is part of a larger application. In the software world, different regular expressions are not compatible with each other. This tutorial focuses on Perl 5 engines, which are the most widely used. We will also mention some differences from other engines. Many modern engines are similar, but not exactly the same. For example. Net regular library, JDK regular package.

3. Text symbols

The most basic regular expression consists of a single literal symbol. For example, < a > > it will match the first occurrence of the character “a” in the string. For example, for the string “Jack is a boy”. The “a” after “J” will be matched. The second “a” will not be matched.

Regular expressions can also match the second “a”, which must be when you tell the regular expression engine to start searching from the first match. In the text editor, you can use find next. In programming languages, there is a function that allows you to continue searching backward from where you first matched.

Similarly, < cat > > will match “cat” in “about cats and dogs”. This is equivalent to telling the regular expression engine to find a < < C > > followed by a < < a > > followed by a < < T > >.

Note that the regular expression engine is case sensitive by default. < < cat > > will not match “cat” unless you tell the engine to ignore case.

(1) Special characters

For text characters, 11 characters are reserved for special purposes. They are:

[ ] \ ^ $ . | ? * + ( )

These special characters are also called metacharacters.

If you want to use these characters as text characters in regular expressions, you need to escape them with a backslash “\”. For example, if you want to match “1 + 1 = 2”, the correct expression is < < 1 \ + 1 = 2 > >

It should be noted that < < 1 + 1 = 2 > > is also a valid regular expression. But it will not match “1 + 1 = 2”, but “111 = 2” in “123 + 111 = 234”. Because “+” means special meaning here (repeat once to many times).

In programming languages, it should be noted that some special characters are processed by the compiler before being passed to the regular engine. Therefore, regular expression < < 1 \ + 2 = 2 > > should be written as “1 \ \ + 1 = 2” in C + +. To match “C: \ temp”, you need to use the regular expression < < C: \ \ temp > >. In C + +, regular expressions become “C: \ \ \ \ temp.”.

(2) Non displayable characters

Special character sequences can be used to represent some non displayable characters:

For tab (0x09)

For carriage return (0x0D)

< for line break (0x0a)

Note that text files in windows use “\ R \ n” to end a line, while UNIX uses “\ R \ n”.

4. Internal working mechanism of regular expression engine

Knowing how the regular expression engine works will help you quickly understand why a regular expression doesn’t work as you expect.

There are two types of engines: text directed and regex directed. Jeffrey Friedl called them DFA and NFA engines. This article talks about the regular oriented engine. This is because some very useful features, such as lazy quantifiers and backreferences, can only be implemented in a regular oriented engine. So it’s no surprise that this engine is the most popular one.

You can easily tell whether the engine you are using is text oriented or regular oriented. If a reverse reference or “lazy” quantifier is implemented, you can be sure that the engine you are using is regular oriented. You can test this by applying the regular expression < < regex|regex not > > to the string “regex not”. If the result of the match is a regex, the engine is regular oriented. If the result is regex not, it is text oriented. Because the regular oriented engine is “urgent”, it will be eager to perform its work and report the first match it finds.

Regular oriented engines always return the leftmost match

This is an important point to understand: even if it is possible to find a “better” match in the future, the regular oriented engine always returns the leftmost match.

When < < cat > > is applied to “he captured a catfish for his cat”, the engine first compares < < C > > with “H”, and the result fails. So the engine failed to compare < < C > > and “e”. Until the fourth character, < C > > matches “C”. < a > > matched the fifth character. By the sixth character < < T > > failed to match “P”, and failed. The engine then continues to recheck the match from the fifth character. At the beginning of the 15th character, < cat > > matches with “cat” in “catfish”. The regular expression engine eagerly returns the result of the first match, instead of continuing to find out whether there are other better matches.

5. character set

Character set is a character set surrounded by a bracket ‘[]’. With character sets, you can tell the regular expression engine to match only one of multiple characters. If you want to match a “a” or an “e”, use < [AE] > >. You can use < < GR [AE] y > > to match gray or grey. This is especially useful when you are not sure whether the characters you are searching for are in us English or British English. On the contrary, < GR [AE] y > > will not match graay or graey. The order of the characters in the character set has nothing to do with it. The results are the same.

You can use the hyphen “-” to define a character range as a character set. Match a single number between 0 and 9. You can use more than one range. < 0-9a-fa-f] > > matches a single hexadecimal number and is case insensitive. You can also combine range definitions with single character definitions. Match a hexadecimal number or letter X. Again, the order of character and range definitions has no effect on the result.

(1) Some applications of character set

Find a word that may be misspelled, such as < < SEP [AE] R [AE] te > > or < < Li [CS] en [CS] e > >.

Find the identifier of the program language, < a-za-z [a-za-z [0-9] * > >. (* means repeat 0 or more times)

Find the C-style hexadecimal number < < 0 [XX] [a-fa-f0-9] + > >. (+ means repeat one or more times)

(2) Reverse character set

The character set will be reversed by following the opening square “[” with an angle bracket “^”. The result is that the character set will match any characters that are not in square brackets. Unlike “.”, inversion character set can match carriage return.

It is important to remember that the inversion character set must match a character. < < Q [^ u] >. It means: match a Q followed by a character that is not U. So it doesn’t match Q in “Iraq”, it matches Q in “Iraq is a country” and a space character. In fact, the space character is part of the match because it’s a “character that’s not U.”.

If you only want to match a Q, if there is a character after Q that is not u, we can use the forward view mentioned later to solve this problem.

(3) Metacharacters in character set

It should be noted that only four characters in the character set have special meanings. They are: ‘] \ ^ -‘. “]” stands for the end of the character set definition; “\” stands for escape; “^” stands for negation; “-” stands for range definition. Other common metacharacters are normal characters within the character set definition and do not need to be escaped. For example, to search for asterisks * or Plus +, you can use < [+ *] > >. Of course, if you escape the usual metacharacters, your regular expressions will work just as well, but this will reduce readability.

In the character set definition, in order to use the backslash “\” as a text character instead of a special character, you need to escape it with another backslash. < [\ \ x] > > will match a backslash and an X. “] ^ -” can be escaped with a backslash, or they can be placed in a position where they cannot be used for their special meaning. We recommend the latter because it increases readability. For example, for the character “^”, put it in the position after the left bracket “[“, and use the meaning of the character instead of the reverse meaning. For example, < x ^] > > will match an X or ^. It will match a “]” or “X”. Either < [- x] > > or < [x -] > > will match a “-” or “X”.

(4) Shorthand for character set

Because some character sets are very common, there are some shorthand methods.

On behalf of;

< w > > stands for word characters. This is different with the implementation of regular expressions. Most of the word character sets implemented by regular expressions contain < < a-za-z0-9 “] >.

< < s > > stands for “white character”. This is also related to different implementations. In most implementations, the space character and the tab character are included, as well as the carriage return line feed character < < R \ n > >.

Abbreviations for character sets can be used inside or outside square brackets. Match a white character followed by a number. Match a single white character or number. It will match a hexadecimal number.

Shorthand for reverse character set

<<[\S]>> = <<[^\s]>>

<<[\W]>> = <<[^\w]>>

<<[\D]>> = <<[^\d]>>

(5) Repetition of character set

If you use the “? * +” operator to repeat a character set, you will repeat the entire character set. Not just the character it matches. Regular expressions < [0-9] + > > match 837 and 222.

If you just want to repeat the matched character, you can use a backward reference. We’ll talk about backward references later.

6. Repeat with? * or +

?: tells the engine to match the leading character 0 times or once. In fact, it means that the leading character is optional.

+: tells the engine to match the leading character 1 or more times

*: tells the engine to match the leading character 0 or more times

“[[a-za-z] [a-za-z0-9] * > matches HTML tags without attributes,” < and “>” are literal symbols. The first character set matches a letter, and the second character set matches a letter or number.

It seems that we can also use < [a-za-z0-9] + >. But it will match < 1 >. But this regular expression is still valid enough when you know that the string you are searching for does not contain similar invalid tags.

(1) Restrictive repetition

Many modern regular expression implementations allow you to define how many times a character is repeated. The morphology is: {min, Max}. Min and Max are non negative integers. If there is a comma and Max is ignored, Max has no limit. If both comma and Max are ignored, repeat min times.

So {0,} is the same as * and {1,} and + are the same.

You can use < < B [1-9] [0-9] {3} \ b > > to match numbers between 1000 and 9999 (“\ B” for word boundary). < B [1-9] [0-9] {2,4} \ b > > matches a number between 100 and 99999.

(2) Pay attention to greed

Suppose you want to match an HTML tag with a regular expression. You know that the input will be a valid HTML file, so regular expressions don’t need to exclude invalid tags. So if the content is between two angle brackets, it should be an HTML tag.

Many new regular expression users will first think of using regular expression <. + > > and they will be surprised to find that for the test string, “this is a < EM > first < / EM > test”, you may expect to return < EM > and then return < / EM > when you continue to match.

But the truth is not. The regular expression will match “< EM > first < / EM >. Obviously this is not the result we want. The reason is that “+” is greedy. That is, “+” causes the regular expression engine to try to repeat the leading characters as much as possible. Only when this repetition will cause the whole regular expression matching to fail, the engine will backtrack. That is, it discards the last “repeat” and then processes the rest of the regular expression.

Similar to “+”, the repetition of “? *” is greedy.

(3) Inside the regular expression engine

Let’s see how the regular engine matches the previous example. The first sign is “<“, which is a text symbol. The second symbol is “.”, matches the character “e”, and “+” matches the rest of the characters until the end of the line. And then to the line break, the match failed (“. Does not match the line break). The engine then begins to match the next regular expression symbol. That is, an attempt was made to match “>”. So far, the “<. +” has matched “< EM > first < / EM > test”. The engine will try to match “>” with a newline character and it fails. So the engine goes back. The result is now “<. +” matches “< EM > first < / EM > tes”. The engine then matches “>” with “t”. Obviously it will still fail. This process continues until “<. +” matches “< EM > first < / em”, and “>” matches “>”. So the engine found a match “< EM > first < / EM >.”. Remember that the regular oriented engine is “eager,” so it’s eager to report the first match it finds. Instead of going back, even if there might be a better match, such as “< EM >”. So we can see that due to the greed of “+”, the regular expression engine returns the longest match on the left.

(4) Replace greed with laziness

A possible solution to the above problem is to replace greed with “+” inertia. You can do this by following the “+” with a question mark “. This scheme can also be used for repetitions represented by “*”, “{}” and “?”. So in the example above we can use “<. +? >”. Let’s take a look at the processing of the regular expression engine.

Once again, the regular expression token ‘<‘ matches the first ‘< of the string. The next regular sign is “.”. This time it’s a lazy “+” to repeat the last character. This tells the regular engine to repeat the last character as little as possible. So the engine matched “.” with the character “e”, and then matched “m” with “>”, and the result failed. The engine will backtrack, which is different from the previous example, because it is lazy repetition, so the engine is expanding lazy repetition instead of reducing it, so “<. +” is now expanded to “< em”. The engine continues to match the next token “>”. This time we got a successful match. The engine then reports that “< EM >” is a successful match. The whole process is roughly the same.

(5) An alternative to inert expansion

We have a better alternative. A greedy repetition can be used with a negated character set: “[^ >] + >”. The reason why this is a better solution is that when using lazy repetition, the engine will backtrack each character before finding a successful match. Backtracking is not necessary to use the inversion character set.

Finally, keep in mind that this tutorial only talks about regular oriented engines. Text oriented engines are not retroactive. But at the same time, they don’t support lazy repetition.

7. Use “.” to match almost any character

In regular expressions, “.” is one of the most commonly used symbols. Unfortunately, it is also one of the most easily misused symbols.

“. matches a single character regardless of the character being matched. The only exception is a new line character. The engines mentioned in this tutorial do not match the new line character by default. So by default, “.” is equal to the shorthand for the character set [^ \ n \ R] (window) or [^ \ n] (Unix).

This exception is for historical reasons. Because early tools that used regular expressions were row based. They all read in a file line by line, and apply regular expressions to each line separately. In these tools, strings do not contain new line characters. So “.” never matches the new line character.

Modern tools and languages can apply regular expressions to large strings or even entire files. All regular expression implementations discussed in this tutorial provide an option to match “.” to all characters, including new line characters. In tools like regexbuddy, editpad pro or powergrep, you can simply select “point number matches new line character”. In Perl, the pattern in which “.” matches a new line character is called a “single line pattern.”. Unfortunately, it’s a very confusing term. Because there is also the so-called “multiline mode”. Multiline mode only affects the anchor at the beginning and end of the line, while single line mode only affects “.”.

Other languages and regular expression libraries also use Perl terms. When using regular expression classes in the. Net framework, you can activate a single line pattern with statements like regex. Match (“string”, “regex”, regexoptions. Singleline)

Conservative use of dot “.”

The point sign is the most powerful metacharacter. It allows you to be lazy: with a dot, you can match almost all the characters. But the problem is that it also often matches characters that shouldn’t be matched.

I’ll give you a simple example. Let’s see how to match a date with the format mm / DD / YY, but we want to allow the user to choose the separator. One solution that will come to mind soon is < < D \ D. \ D \ D. \ D \ d > >. It looks like it matches the date “02 / 12 / 03”. The problem is that 02512703 will also be considered a valid date.

It looks like a better solution. Remember that point numbers are not metacharacters in a character set. This scheme is far from perfect. It will match “99 / 99 / 99”. And < [0-1] \ d [- /. [0-3] \ d [- /.] \ D \ d > > goes further. Although he will match “19 / 39 / 99”. How perfect you want your regular expressions to be depends on what you want to achieve. If you want to verify user input, you need to be as perfect as possible. If you just want to analyze a known source, and we know that there is no wrong data, it is enough to use a better regular expression to match the characters you want to search.

8. Anchoring of string start and end

Unlike regular expression symbols, anchors don’t match any characters. Instead, they match the position before or after the character. “^” matches the position before the first character of a line string. < ^ a > > will match a in the string “ABC”. < ^ b > > will not match any characters in “ABC”.

Similarly, $matches the position after the last character in the string. So < < C $> > matches C in “ABC”.

(1) Application of anchoring

It is very important to use anchors when verifying user input in programming languages. If you want to verify that the user’s input is an integer, use < ^ \ D + $> >.

In user input, there are often redundant leading or ending spaces. You can use < ^ \ s * > > and < < s * $> > to match leading or ending spaces.

(2) Use “^” and “$” as line start and end anchors

If you have a string that contains more than one line. For example: “first line \ n \ rsecond line” (where \ n \ R represents a new line character). It is often necessary to treat each line separately rather than the entire string. Therefore, almost all regular expression engines provide an option to extend the meaning of both anchors. “^” can match the starting position of the string (before f) and the following position of each new line character (between \ n \ R and s). Similarly, $matches the end of the string (after the last E) and the front of each new line character (between E and \ n \ R).

In. Net, when you use the following code, you will define the position before and after each new line character: regex. Match (“string”, “regex”, regexoptions. Multiline)

Application: String STR = regex.replace (original, “^”, “>”, regexoptions. Multiline) — will insert “>” at the beginning of each line.

(3) Absolute anchoring

Only the start position of the whole string is matched, and only the end position of the whole string is matched. Even if you use the multiline mode, < a > > and < < z > > never match the new line character.

Even if \ Z and $only match the end of the string, there is an exception. If the string ends with a new line character, then \ Z and $will match the position before the new line character, not the end of the entire string. This “improvement” was introduced by Perl and followed by many regular expression implementations, including Java,. Net, etc. If you apply < ^ [A-Z] + $> > to “Joe \ n”, the match result is “Joe” instead of “Joe \ n”.

The above is a detailed introduction to regular expression, hoping to help you better understand regular expression.