Four applications of regular expressions in web page processing


Regular expression provides an efficient and convenient method for string pattern matching. Almost all high-level languages provide support for regular expressions or an off the shelf code base for invocation. Taking the common processing tasks in ASP environment as an example, this paper introduces the application skills of regular expression.

1、 Verify the format of passwords and email addresses

Our first example demonstrates a basic function of regular expressions: abstractly describing arbitrarily complex strings. It means that regular expressions give programmers a formal string description method, which can describe any string pattern encountered by applications with little code. For example, for people who are not engaged in technical work, the requirements for password format can be described as follows: the first character of the password must be a letter, the password must be at least 4 characters and no more than 15 characters, and the password cannot contain characters other than letters, numbers and underscores.

As programmers, we must convert the above natural language description of password format into other forms, so that ASP pages can understand and apply it to prevent illegal password input. The regular expression describing this password format is: ^ [a-za-z] \ w{3,14} $. In the ASP application, we can write the password verification process as a reusable function, as shown below:

Function TestPassword(strPassword)Dim reSet re = new RegExpre.IgnoreCase = = falsere.Pattern = "^[a-zA-Z]\w{3,14}$"TestPassword = re.Test(strPassword)End Function

Let’s compare the regular expression for checking the password format with the natural language description:
The first character of the password must be a letter: the regular expression description is “^ [a-za-z]”, where “^” represents the beginning of the string, and the hyphen tells regexp to match all characters in the specified range.
The password must be at least 4 characters and no more than 15 characters: the regular expression description is “{3,14}”.
Passwords cannot contain characters other than letters, numbers, and underscores: the regular expression description is’ \ W ‘.

Several notes: {3, 14} indicates that the previous pattern matches at least 3 but no more than 14 characters (plus the first character, it becomes 4 to 15 characters). Note that the syntax requirements in curly braces are extremely strict, and spaces are not allowed on both sides of commas. If spaces are added, it will affect the meaning of regular expressions and lead to errors in password format verification. In addition, the “$” character is not added to the end of the regular expression above$ Characters cause the regular expression to match the string to the end, ensuring that no other characters are added after the legal password.

Similar to password format verification, checking the validity of email address is also a common problem. Simple email address verification with regular expression can be realized as follows:

<%Dim reSet re = new RegExpre.pattern = "^\[email protected][a-zA-Z_]+?\.[a-zA-Z]{2,3}$"Response.Write re.Test("[email protected]")%>

2、 Extract specific parts of an HTML page

The main problem of extracting content from HTML pages is that we must find a way to accurately identify the part of content we want. For example, here is a snippet of HTML code that displays News Headlines:

< table border = "0" width = "11%" > < tr > < TD width = "100%" > < p align = "center" > other contents... < / td > < / TR > < / Table > < table border = "0" width = "11%" > < TR > < TD width = "100%" > < p align = "center" > Iraq war! </ Td > < / TR > < / Table > < table border = "0" width = "11%" > < tr > < TD width = "100%" > < p align = "center" > other contents... < / td > < / TR > < / Table >

Observing the above code, it is easy to see that the news title is displayed by the table in the middle, and its class attribute is set to headline. If the HTML page is very complex, you can only view the HTML code of the selected part of the page by using an additional function provided by Microsoft IE from 5.0. Please visit more. For this example, we assume that this is the only table with the class attribute set to header. Now we need to create a regular expression, find the header table through the regular expression and include the table in our own page. The first is to write code that supports regular expressions:

<% dim re, strhtmlset re = new regexp 'create regular expression object re.ignorecase = truere. Global = false' end search after first match% >

Let’s consider the area we want to extract: here, we want to extract the whole < Table > structure, including the end tag and the text of the news title. Therefore, the starting character of the search should be < Table > start tag: re. Pattern = “< table. * (? = header)”.

This regular expression matches the start tag of the table and can return everything from the start tag to “header” (except line feed). The following is the method to return the matched HTML code:

'put all matching HTML codes into the matches set set matches = re. Execute (strhtml)' display all matching HTML codes for each item in matchsresponse. Write item. Valuenext 'display one of the response.write matches.item (0). Value

Run this code to process the HTML fragment shown above. The regular expression returns the matching content once as follows: < table border = “0” width = “11%” < table. * (? = header) (. | \ n) *? </ Table > “. Where” * “after:” (. | \ n) “matches 0 to more than one arbitrary character; while”? “Minimizes the matching range of” * “, that is, matches as few characters as possible before finding the next part of the expression. < / Table > is the end tag of the table.

The “? Qualifier is very important because it prevents the expression from returning code from other tables. For example, for the HTML code fragment given earlier, if you delete this “?, the returned content will be:

< table border = "0" width = "11%" > < tr > < TD width = "100%" > < p align = "center" > Iraq war! </ Td > < / TR > < / Table > < table border = "0" width = "11%" > < tr > < TD width = "100%" > < p align = "center" > other contents... < / td > < / TR > < / Table >

The returned content includes not only the < Table > tag of the header table, but also the someotherstore table. It can be seen that the “?” here is essential.

This example assumes some rather idealized premises. In practical applications, the situation is often much more complex, especially when you have no influence on the writing of the source HTML code you are using, it is particularly difficult to write asp code. The most effective way is to spend more time analyzing the HTML near the content to be extracted and testing it frequently to ensure that the extracted content is exactly what you need.

In addition, we should pay attention to and deal with cases where regular expressions cannot match any content of the source HTML page. The content update may be very fast. Don’t make low-level and ridiculous mistakes on your page just because others have changed the format of the content.

3、 Parsing text data files
There are many formats and types of data files. XML documents, structured text and even unstructured text often become the data source of ASP applications. An example we will look at below is a structured text file using qualifiers. Qualifiers (such as quotation marks) indicate that all parts of the string are indivisible, even if the string contains separators that separate records into fields. Here is a simple structured text file:

Last name, first name, telephone, description sun, Wukong, 312 555 5656, ASP very good, pig, Bajie, 847 555 5656, I'm a film producer

This file is very simple. Its first line is the title, and the next two lines are records separated by commas. To parse this file is also very simple. You only need to divide the file into lines (according to the newline symbol), and then divide each record according to fields. However, if we add a comma to the content of a field:

Last name, first name, telephone, description sun, Wukong, 312 555 5656, I like ASP, VB and SQL, pig, Bajie, 847 555 5656, I'm a film producer

A problem occurs when parsing the first record, because the parser that only recognizes comma separators looks like its last field contains the contents of two fields. To avoid such problems, fields containing separators must be surrounded by qualifiers. Single quotation marks are a common qualifier. After adding the single quotation mark qualifier to the above text file, its contents are as follows:

Last name, first name, telephone, description sun, Wukong, 312 555 5656, 'I like ASP, VB and SQL' pig, Bajie, 847 555 5656, 'I'm a film producer'

Now we can determine which comma is the separator and which comma is the field content, that is, we only need to treat the comma inside the quotation marks as the field content. What we need to do next is to implement a regular expression parser, which determines when to separate fields according to commas and when to treat commas as as field contents.

The problem here is slightly different from that faced by most regular expressions. Usually we look at a small portion of the text to see if it matches a regular expression. But here, only after considering the whole line of text can we reliably determine what is within quotation marks.

The following is an example to illustrate this problem. Randomly extract half a line from a text file and get: 1, beach, black, 21 ‘, dog, cat, duck. In this example, because there are other data on the left of “1”, it is extremely difficult to analyze its content. We don’t know how many single quotation marks are in front of the data fragment, so we can’t judge which characters are within the quotation marks (the text within the quotation marks cannot be separated during parsing). If there are even (or no) single quotation marks before the data fragment, “‘, dog, cat, duck,'” is a string defined by quotation marks and cannot be separated. If the number of quotation marks in front is odd, then “1, sand, black, 21, ‘” is the end of a string and is indivisible.

Therefore, the regular expression must analyze the whole line of text and comprehensively consider how many quotation marks appear to determine whether the character is inside or outside the quotation mark pair, that is: (? = ([^ ‘] *’ [^ ‘] *’) * (?! [^ ‘] *’). The regular expression first finds a quotation mark, and then continues to find and ensure that the number of single quotation marks after the comma is either even or 0. The regular expression is based on the judgment that if the number of single quotes after a comma is even, the comma is outside the string. The following table gives a more detailed description:

, Find a comma
(?= Continue looking forward to match the following pattern:
( Start a new mode
[^’]*’ [non quoted characters] 0 or more, followed by a quotation mark
[^’]*'[^’]*) [non quotation mark character] 0 or more characters, followed by a quotation mark. After combining the previous content, it matches the quotation mark pair
)* End the pattern and match the entire pattern (quote pair) 0 or more times
(?! Look forward to exclude this mode
[^’]*’ [non quoted characters] 0 or more, followed by a quotation mark
) End mode

The following is a VBScript function, which accepts a string parameter, divides the string according to the comma separator and single quote qualifier in the string, and returns the result array:

Function splitadv (strinput) dim objreset Objre = new regexp 'set regexp object Objre. Ignorecase = trueobjre. Global = trueobjre. Pattern = ", (? = ([^'] * '[^'] * '*') * (?! [^ '] *'))" replace the comma we want to use with Chr (8), Chr (8) is the \ B 'character, and the occurrence of \ B in the string may be very small. " Then we save the string segmentation to the array splitadv = split (Objre. Replace (strinput, "\ B"), "\ B") end function according to \ B

In a word, parsing text data files with regular expressions has the advantages of high efficiency and shortening development time. It can save a lot of time to analyze files and extract useful data according to complex conditions. In a rapidly developing environment, there will still be many traditional data to be used. It will be a valuable skill to master how to construct efficient data analysis routines.

4、 String substitution

In the last example, we’ll look at the replacement function of VBScript regular expressions. ASP is often used to dynamically format text obtained from various data sources. Using the powerful function of VBScript regular expression, ASP can dynamically change the matching complex text. Highlighting some words by adding HTML tags is a common application, such as highlighting search keywords in search results.
To illustrate the implementation, let’s take a look at an example that highlights all “. Net” in the string. This string can be obtained from anywhere, such as a database or other web site.

<% set regex = new regexpregex. Global = trueregex. Ignorecase = true 'regular expression pattern,' look for any word or URL ending with '. Net'. Regex. Pattern = "(\ B [a-za-z \.] +? \. Net \ b)" 'string used to test the replacement function strText = "Microsoft has established a new website"' calling the replace method of regular expression '$1 means inserting the matching text into the current location response.write regex.replace (strText, < B style ='color: #000099; font size: 18pt' > $1 < / b >)% >

There are several important points to note in this example. The entire regular expression is placed in a pair of parentheses. Its function is to intercept all matching content for later use, which is referenced by $1 in the replacement text. Similar interceptions can use up to nine per replacement, referenced from $1 to $9, respectively. The replace method of regular expression is different from the replace function of VBScript itself. It only needs two parameters: the searched text and the replaced text.
In this example, in order to highlight the searched “. Net” strings, we surround them with bold tags and other style attributes. Using this search and replace technology, we can easily add the function of highlighting search keywords to the website search program, or automatically add links to other pages to the keywords appearing on the page.


I hope the several regular expression techniques introduced in this article will inspire you when and how to apply regular expressions. Although the examples in this paper are written in VBScript, regular expressions are also very useful in It is one of the main mechanisms for server-side control form verification, and exported to the whole. Net framework through the system.text.regularexpressions namespace. (

Recommended Today

Build the shadowplay engine game engine from the new folder (6)

Preface to this chapter After a long period of debugging and thinking correction, we can start to implement our memory management module. I said earlier that if you want to continue to learn, you may need some knowledge of computer composition principles and operating systems. However, in the process of coding, I gradually found that […]