C # in the regular expression learning materials

Time:2021-11-25

Regular expressions in C #  

Jeffrey   E.F.   Friedl wrote a book on regular expressions, mastering regular expressions. In order to make readers better understand and master regular expressions, the author made up a story. The language of the book is mainly Perl. As far as I know, the regular expressions in C # are also based on perl5. So they should have a lot in common.   http://ike.126.com
In fact, I do not intend to translate the contents of the book intact. First, there are too many contents in this book, and I am not qualified for the work of translation at all; Second, if I really translate the book and change the code into c#, I may be suspected of infringement without obtaining the consent of the original author. So, let’s take it as a reading note.  

Skipping the lengthy preface, we can go directly to Chapter 1:  

Introducing regular expressions  

The author said that this chapter is prepared for the absolute rookie of regular expressions in order to lay a solid foundation for future chapters. So if you are not a rookie, you can ignore this chapter.  

Story scene:  
The head of your archives department wants a tool to check for duplicate words (e.g. this)   This), a common problem when editing a large number of documents. Your job is to create a solution:  
Accept any number of files to check, report those lines with duplicate words in each file, highlight these duplicate words, and ensure that the original file name and these lines appear in the report.  
Cross line check to find the repetition of the last word in a line and the first word at the beginning of the next line.  
Find repeated words, regardless of whether they have different case (e.g. the   The), and allowing a different number of white space characters (spaces, tabs, new lines, etc.) between these repeated words  
Find duplicate words, even if they are separated by HTML tags. (for example:… It   is  < B>very</B>   very   important.)  

To solve the above practical problems, we first need to write regular expressions, find the text we want, ignore the text we don’t need, and then use our c# code to process the obtained text.  

Before using regular expressions, you may know more or less what regular expressions are. Even if you don’t know, you’re almost sure to be familiar with its basic concepts.  
You know that report.txt is a specific file name, but if you have any UNIX or DOS / windows experience, you also know that “*. TXT” can be used to select multiple files. In this form of file name, some characters have special meanings. An asterisk means to match anything, and a question mark means to match a character. For example: “*. TXT” means any file whose file name ends in. TXT.  
The file name has pattern matching, using limited matching characters. In addition, the search engines on the current network also allow content search with some specified matches. Regular expressions use rich matching characters and can deal with various complex problems.  

First, we introduce two position matchers:  
^  :  Represents the beginning of a line of text  
$  :  Indicates the end position of a line of text  

For example: expression: “^ cat”,   The matching word cat appears at the beginning of the line. Note that ^ is a positional character, not the character itself to be matched.  
Similarly, the expression: “cat $”   Matching words appear at the end of a line.  

Next, we introduce the square brackets “[]” in the expression,   It represents matching one of the characters in parentheses. For example:  
Expression: “[0123456789]” will match any one of the numbers 0 to 9.  
For example, if we want to find all the text that contains gray or grey, the expression can be written as follows: “GR [EA] y”  
[EA] means to match one EA, not the whole EA.  

If we want to match the tags of < H1 > < H2 > < H3 > < H4 > < H5 > < H6 > in HTML, we can write an expression:  
“< h [123456] >”, but what if we want to match one of all the characters? Ha, that’s the problem. Write all the characters in square brackets? Fortunately, we don’t have to do this. We introduce the range symbol “-“;  
Using the range symbol, we only need to give the boundary character of a range. In the above HTML example, we can write: “< h [1-6] >”  
And the expression: “[0-9a-zA-Z]” is now clear? It matches numeric characters, one of 26 lowercase letters and 26 uppercase letters.  

The “^” symbol appearing in []  
If you see an expression such as “[^ 0-9]”, “^” is no longer the position symbol mentioned earlier. Here, it is a negative symbol, indicating exclusion. The above expression indicates that it does not contain characters from numbers 0 to 9.  

Think 1: the expression “Q [^ u]” means. If there are the following words, which will be matched?  
Iraqi 

Iraqian 

miqra 

qasida 

qintar 

qoph 

zaqqum 

In addition to the representation of the range character, there is a dot character “.” which appears in the expression to match any character.  
For example, the expression: “07.04.76” will match:  
Shape: 07 / 04 / 76,   07-04-76,07.04.76。  

If we need to be selectable in some characters, we can use the option character “|”:  
The option character has the meaning of “or”. For example, the expression: “[bob|robert]” means that Bob or Robert will be matched.  
Now look at the expression we mentioned earlier: “gr[ea]y”  , Using the option characters, we can write “grey|gray”, which are the same.  
Use of parentheses: parentheses are also used as metacharacters in expressions. For example, the previous expression can be written as: “gr (e|a) y”. Parentheses here are necessary. If there are no parentheses, the expression “gre|ay” will match GRE or ay, which is not the result we want. If you don’t know, let’s take a look at the following example:  
Find all lines starting with from: or subject: or date: in the e-mail. We compare the following two expressions:  
Expression 1: ^ from subject data:  ” 
Expression 2: “^ (from subject data):  ” 
Which one do we want?  
Obviously, the result of expression 1 is not what we want. It will match from or subjec or data:  , Expression 2 can meet our needs by using circle enclosing characters.  

Word boundary  
We can already match the characters that appear at the beginning and end of a line, so what if we want to locate more than the beginning or end of a line? We need to introduce the word boundary symbol. The word boundary symbol is: “\ B”. The slash cannot be omitted, otherwise it will become the matching letter B. Using word boundary symbols, we can locate that the matching position must appear at the beginning or end of a word, not in the middle of the word. For example: “\ bis \ B” expression in the string “this”   is   a   Cat. “Will match the word” is “instead of” is “in the word” this “.  

String boundary symbol  
In addition to the above position symbols, if we want to match the whole string (including multiple words), we can use the following two symbols:  
\A  : Represents the beginning of the string;  
\z  : Represents the end of a string.  
Expression: “\ athis”   is   a   Cat \ Z “will match this string” this   is   a   cat”。  
Using boundary positioning symbols, an important concept to be mentioned here is word characters. Word characters represent characters that can form words. They are any character in [a-za-z0-9]. So the above expression will also be in the sentence “this”   is   a   Cat. “Get a match. The matching result does not contain a period.  

Repeated quantity symbol  
Let’s look at the expression: “colou? R”,   A question mark that we haven’t seen before appears in this expression (the meaning of this question mark is different from that of the matching file name). It indicates the number of times a character in front of the symbol can be repeated, “?” indicates 0 or 1 times. In the previous expression, the question mark indicates that u can appear 0 or 1 times, so it will match “color” or “colour”.  
The following are other repeated quantity symbols:  
+  : Indicates one or more times  
*  : Indicates 0 or more times  
For example, if we want to represent one or more spaces, we can write an expression:“  +”; 

What if you want to express the specific number of times? We introduce the glyph {}.  
{n}  :  N is a specific number, indicating n repetitions.  
{n,m}:   It means the least time and the most m times.  

These symbols limit the number of matches of the character preceding the symbol. But what if you want to repeat multiple characters, such as a word? We use parentheses again. Previously, we use parentheses as the range symbol of options. Here is another way to use parentheses. It is represented as a group. For example, the expression: “(this)” this here is a group, so the problem is easy. The repetition number symbol can be used to represent the repetition times of the group in front of it.  

Now back to the problem of finding duplicate words, if we want to find “the”   According to what we have learned so far, we can write the expression:  
“\bthe +the\b” 

Expression means to match two the separated by one or more spaces.  
Similarly, we can also write:  
“\b(the +){2}” 

But what if you want to find all possible repeated words? Our current knowledge is not enough to solve this problem. Next, we introduce the concept of back reference. We have seen that parentheses can be used as the boundary of groups. There can be multiple groups defined by parentheses in an expression. According to the order in which they appear, these groups are assigned a group number by default, the first group number is No. 1, and so on. Then the back reference is to use “\ n” to reference this group at the position of the subsequent expression, where n is the referenced group number. Reverse reference is like a variable in a program. Let’s see a specific example below:  
The previous word repeats the expression. Now we can use reverse reference to write:  
“\b(the) +\1\b” 

Now, if we want to match all repeated words, we can rewrite the expression as:  
“\b([a-zA-Z]+) +\1\b” 

The last question is, what if the character we want to match is a symbol in a regular expression? Yes, use the escape symbol “\”,   For example, if you want to match a decimal point, you can: “\.” also note that if you use an expression in the program, the “\” should also be changed to “\ \” according to the provisions of the string or preceded by @.  

This chapter only provides rookies with a basic knowledge of regular expressions. It is only part of it. We still have many things to learn, which will be introduced in later chapters. In fact, learning regular expressions is not difficult. What you need is patience and practice if you want to master it. Maybe someone says, “I don’t want to know the details of the car, I just want to learn how to drive.” if you think so, you will never know how to use regular expressions to solve your problems, and then you will never understand the real power of regular expressions.