Four rules of regular expression application

Time:2021-12-30

I wanted to summarize before. I haven’t had time. I saw a good article today. I hereby contribute and enjoy the powerful functions of regular!!
The following is the text:
————————————————————

Regular expression provides an efficient and convenient method for string pattern matching. Almost all high-level languages provide support for regular expression or provide ready-made code base for calling. This paper introduces the application skills of regular expression by taking the common processing tasks in ASP environment as an example

I. check the format of password and email address

Our first example demonstrates a basic function of regular expressions: abstractly describing arbitrarily complex strings. It means that regular expressions give programmers a formal string description method, which can describe any string pattern encountered by applications with little code. For example, for people who are not engaged in technical work, the requirements for password format can be described as follows: the first character of the password must be a letter, the password must be at least 4 characters and no more than 15 characters, and the password cannot contain characters other than letters, numbers and underscores.  

As programmers, we must convert the above natural language description of password format into other forms, so that ASP pages can understand and apply it to prevent illegal password input. The regular expression describing this password format is: ^ [a-za-z] \ w{3,14} $.  

In the ASP application, we can write the password verification process as a reusable function, as shown below:

  Function TestPassword(strPassword) 
  Dim re 
  Set re = new RegExp 

  re.IgnoreCase = false 
  re.global = false 
  re.Pattern = “^[a-zA-Z]\w{3,14}$” 

  TestPassword = re.Test(strPassword) 
  End Function 

Let’s compare the regular expression for checking the password format with the natural language description:

The first character of the password must be a letter: the regular expression description is “^ [a-za-z]”, where “^” represents the beginning of the string, and the hyphen tells regexp to match all characters in the specified range.  

The password must be at least 4 characters and no more than 15 characters: the regular expression description is “{3,14}”.  

Passwords cannot contain characters other than letters, numbers, and underscores: the regular expression description is’ \ W ‘.  

Notes: {3, 14} indicates that the previous pattern matches at least 3 But no more than 14 characters (4 to 15 characters if the first character is added). Note that the syntax requirements in curly braces are extremely strict, and spaces are not allowed on both sides of commas. If spaces are added, it will affect the meaning of regular expressions and lead to errors in password format verification. In addition, “$” is not added at the end of the above regular expressions Character$ Characters cause the regular expression to match the string to the end, ensuring that no other characters are added after the legal password.  

Similar to password format verification, checking the validity of email address is also a common problem. Simple email address verification with regular expression can be realized as follows:

  <% 
  Dim re 
  Set re = new RegExp 

  re.pattern = “^\[email protected][a-zA-Z_]+?\.[a-zA-Z]{2,3}$” 
  Response.Write re.Test(“[email protected]”) 
  %> 
——————————————————
2、 Extract specific parts of an HTML page

The main problem of extracting content from HTML pages is that we must find a way to accurately identify the part of content we want. For example, here is a snippet of HTML code that displays News Headlines:

  <table border=”0″ width=”11%” class=”Somestory”> 
  <tr> 
  <td width=”100%”> 
< p align = “center” > other contents… </ td> 
  </tr> 
  </table> 
  <table border=”0″ width=”11%” class=”Headline”> 
  <tr> 
  <td width=”100%”> 
< p align = “center” > Iraq war! </ td> 
  </tr> 
  </table> 
  <table border=”0″ width=”11%” class=”Someotherstory”> 
  <tr> 
  <td width=”100%”> 
< p align = “center” > other contents… </ td> 
  </tr> 
  </table> 

Observing the above code, it is easy to see that the news title is displayed by the table in the middle, and its class attribute is set to headline. If the HTML page is very complex, you can only view the HTML code of the selected part of the page by using an additional function provided by Microsoft} ie from 5.0. Please visit http://www.microsoft.com/Windows/ie/WebAccess/default.ASP Learn more. For this example, we assume that this is the only table with the class attribute set to header. Now we need to create a regular expression, find the header table through the regular expression and include the table in our own page. The first is to write code that supports regular expressions:

  <% 
  Dim re, strHTML 
Set # re = new # regexp ‘create regular expression object

  re.IgnoreCase = true 
  re. Global = false ‘end search after first match
  %> 

Let’s consider the area we want to extract: here, we want to extract the whole < Table > structure, including the end tag and the text of the news title. Therefore, the starting character of the search should be < Table > start tag: re Pattern = “<table.*(?=Headline)”。 This regular expression matches the start tag of the table and can return all contents from the start tag to “header” (except line feed). The following is the method to return the matched HTML code:

‘put all matching HTML code into the matches collection
  Set Matches = re.Execute(strHTML) 

‘show all matching HTML codes
  For Each Item in Matches 
  Response.Write Item.Value 
  Next 

‘show one of them
  Response.write Matches.Item(0).Value 

Run this code to process the HTML fragment shown above, and the regular expression returns the matching content once as follows: < table border = “0” width = “11%” class = “. The” (? = header) “in the regular expression does not get the character, so you can’t see the value of the table class attribute. The code to get the rest of the table is also quite simple: re. Pattern =” < table* (?=Headline)(.|\n)*?</ Table > “. Where” * “after:” (. | \ n) “matches 0 to more than one arbitrary character; and”? ” Minimize the matching range of “*”, that is, match as few characters as possible before finding the next part of the expression. </ Table > is the end mark of the table.  

    “?” Restrictors are important because they prevent expressions from returning code from other tables. For example, for the HTML code fragment given earlier, if you delete this “?” The returned content will be:

  <table border=”0″ width=”11%” class=”Headline”> 
  <tr> 
  <td width=”100%”> 
< p align = “center” > Iraq war! </ td> 
  </tr> 
  </table> 
  <table border=”0″ width=”11%” class=”Someotherstory”> 
  <tr> 
  <td width=”100%”> 
< p align = “center” > other contents… </ td> 
  </tr> 
  </table> 

   
The returned content includes not only the < Table > tag of the header table, but also the someotherstore table. You can see that the “? Here Is essential.  

This example assumes some rather idealized premises. In practical applications, the situation is often much more complex, especially when you have no influence on the writing of the source HTML code you are using, it is particularly difficult to write asp code. The most effective way is to spend more time analyzing the HTML near the content to be extracted and testing it frequently to ensure that the extracted content is exactly what you need. In addition, we should pay attention to and deal with cases where regular expressions cannot match any content of the source HTML page. The content update may be very fast. Don’t make low-level and ridiculous mistakes on your page just because others have changed the format of the content.
—————————————————-
3、 Parsing text data files

There are many formats and types of data files. XML documents, structured text and even unstructured text often become the data source of ASP applications. An example we will look at below is a structured text file using qualifiers. Qualifiers (such as quotation marks) indicate that all parts of the string are indivisible, even if the string contains separators that separate records into fields

Here is a simple structured text file:

Last name, first name, telephone number, description
Sun, Wukong, 312 555 5656, ASP is very good
Pig, Bajie, 847 555 5656, I’m a film producer

This file is very simple. Its first line is the title, and the next two lines are records separated by commas. It is also easy to parse this file. You only need to divide the file into lines (according to the newline symbol), and then divide each record according to the field. However, if we add a comma to the content of a field:

Last name, first name, telephone number, description
Sun, Wukong, 312 555 5656, I like ASP, VB and SQL
Pig, Bajie, 847 555 5656, I’m a film producer

A problem occurs when parsing the first record, because the parser that only recognizes comma separators looks like its last field contains the contents of two fields. To avoid such problems, fields containing separators must be surrounded by qualifiers. Single quotation marks are a common qualifier. After adding the single quotation mark qualifier to the above text file, its contents are as follows:

Last name, first name, telephone number, description
Sun, Wukong, 312 555 5656, ‘I like ASP, VB and SQL’
Pig, Bajie, 847 555 5656, ‘I’m a film producer’

Now we can determine which comma is the separator and which comma is the field content, that is, we only need to treat the comma inside the quotation marks as the field content. What we need to do next is to implement a regular expression parser, which determines when to separate fields according to commas and when to treat commas as as field contents.  

The problem here is slightly different from that faced by most regular expressions. Usually we look at a small portion of the text to see if it matches a regular expression. But here, only after considering the whole line of text can we reliably determine what is within quotation marks.  

The following is an example to illustrate this problem. Randomly extract half a line from a text file and get: 1, beach, black, 21 ‘, dog, cat, duck. In this example, because there are other data on the left of “1”, it is extremely difficult to analyze its content. We don’t know how many single quotes are in front of this data fragment, Therefore, it is impossible to determine which characters are within quotation marks (the text within quotation marks cannot be separated during parsing). If there are even (or no) single quotation marks before this data fragment, then “‘, dog, cat, duck,'” is a character string defined by quotation marks and cannot be separated. If the number of quotation marks in front is odd, then “1, sand, black, 21, ‘” Is the end of a string and is indivisible.  

Therefore, the regular expression must analyze the whole line of text and comprehensively consider how many quotation marks appear to determine whether the character is inside or outside the quotation mark pair, that is: (? = ([^ ‘] *’ [^ ‘] *’) * (?! [^ ‘] *’). The regular expression first finds a quotation mark, and then continues to find and ensure that the number of single quotation marks after the comma is either even or 0. The regular expression is based on the judgment that if the number of single quotes after a comma is even, the comma is outside the string. The following table gives a more detailed description:

, look for a comma
(? = continue looking forward to match the following pattern:
(start a new mode)
[^ ‘] *’ [non quoted characters] 0 or more, followed by a quotation mark
[^ ‘] *’ [^ ‘] *] [non quotation mark characters] 0 or more, followed by a quotation mark. After combining the previous content, it matches the quotation mark pair
) * end the pattern and match the whole pattern (quotation mark pair) 0 or more times
(?! find forward to exclude this mode
[^ ‘] *’ [non quoted characters] 0 or more, followed by a quotation mark
) end mode

The following is a VBScript function, which accepts a string parameter, divides the string according to the comma separator and single quote qualifier in the string, and returns the result array:

  Function SplitAdv(strInput) 
  Dim objRE 
  Set objRE = new RegExp 

‘set regexp object
  objRE.IgnoreCase = true 
  objRE.Global = true 
  objRE.Pattern = “,(?=([^’]*'[^’]*’)*(?![^’]*’))” 

‘the replace method replaces the comma we want to use with Chr (8), which is \ B
‘characters, \ B can appear in a string very small.  
‘then we split the string and save it to the array according to \ B
  SplitAdv = Split(objRE.Replace(strInput, “\b”), “\b”) 
  End Function 

In a word, parsing text data files with regular expressions has the advantages of high efficiency and shortening development time. It can save a lot of time to analyze files and extract useful data according to complex conditions. In a rapidly developing environment, there will still be many traditional data to be used. It will be a valuable skill to master how to construct efficient data analysis routines.  
——————————————————————-
4、 String substitution

In the last example, we’ll look at the replacement function of VBScript regular expressions. ASP is often used to dynamically format text obtained from various data sources. Using the powerful function of VBScript regular expression, ASP can dynamically change the matching complex text. Highlighting some words by adding HTML tags is a common application, such as highlighting search keywords in search results.  

To illustrate the implementation, let’s take a look at an example that highlights all “. Net” in the string. This string can be obtained from anywhere, such as a database or other web site.  

  <% 
  Set regEx = New RegExp 
  regEx.Global = true 
  regEx.IgnoreCase = True 

‘regular expression pattern
‘look for any word or URL ending with’. Net ‘.  
  regEx.Pattern = “(\b[a-zA-Z\._]+?\.NET\b)” 

‘string used to test the substitution function
StrText = “Microsoft has established a new website www.asp.net.”  

‘call the replace method of the regular expression
‘$1 means to insert the matching text into the current position
  Response.Write regEx.Replace(strText, _ 
  “<b style=’color: #000099; font-size: 18pt’>$1</b>”) 
  %> 

There are several important points to note in this example. The entire regular expression is placed in a pair of parentheses. Its function is to intercept all matching content for later use, which is referenced by $1 in the replacement text. Similar interceptions can use up to nine per replacement, referenced from $1 to $9, respectively. The replace method of regular expression is different from the replace function of VBScript itself. It only needs two parameters: the searched text and the replaced text.  

In this example, in order to highlight the searched “. Net” strings, we surround them with bold tags and other style attributes. Using this search and replace technology, we can easily add the function of highlighting search keywords to the website search program, or automatically add links to other pages to the keywords appearing on the page.  

Conclusion

I hope the several regular expression techniques introduced in this article will inspire you when and how to apply regular expressions. Although the example of this article is written in VBScript, it is in ASP Net is also very useful. It is one of the main mechanisms for server-side control form verification, and it passes system. Net Text. The regularexpressions namespace is exported to the entire Net framework.

Recommended Today

3. Big difference analysis R package: deseq2, edger and limma

Data required for variance analysis:Expression matrixandGrouping informationThe data of TCGA only need the expression matrix, because the sample ID of TCGA is special. Whether the 14th and 15th bits of the sample ID are > =10 or <10 represents whether the sample is a normal sample or a tumor sample. The starting point of the […]