Explain the basic usage of regular expression in Python 3

Time:2019-11-19

regular expression

In this section, let’s look at the related usage of regular expression. Regular expression is a powerful tool for string processing. It has its own specific syntax structure. With it, it’s easy to realize string retrieval, replacement and matching verification.

Of course, for crawlers, with it, it is very convenient for us to extract the information we want from HTML.

Case introduction

Having said so much, maybe we are still vague about what it is, let’s use a few examples to feel the usage of regular expressions.

We open the regular expression test tool http://tool.oschina.net/regex/ provided by opensource China. After we open it, we can input the text to be matched, and then select the common regular expression, and then we can get the corresponding matching results from the text we input.

For example, we enter the text to be matched here as follows:

Hello, my phone number is 010-86432100 and email is [email protected], and my website is http://cuiqingcai.com.

This string contains a phone number and an email. Next, we try to extract it with regular expressions.

If we choose to match the email address in the web page, we can see the email in the text below. If we choose to match the URL of the URL, we can see the URL in the text below. Isn’t it amazing?

In fact, here is to use regular expression matching, that is to use certain rules to extract specific text. For example, e-mail begins with a string, then an @ symbol, and then a domain name, which has a specific composition format. In addition, for URLs, the protocol type begins with a colon and a double slash followed by a domain name and a path.

For URLs, we can match them with the following regular expressions:

[a-zA-z]+://1*

If we use this regular expression to match a string, if the string contains url like text, it will be extracted.

This regular expression seems to be a mess. In fact, there are specific syntax rules in it. For example, A-Z stands for matching any lowercase letter, s stands for matching any white space character, and * stands for matching any number of characters in front of it. This long string of regular expressions is the combination of so many matching rules, and finally realizes specific matching functions.

After writing the regular expression, we can take it to a long string to match and find it. No matter what is in the string, as long as it conforms to the rules we wrote, we can find it all. So for the web page, if we want to find out how many URLs are in the source code of the web page, we can use the regular expression matching the URL to match, and then we can get the URL in the source code.

We have mentioned several matching rules above, so how many rules are there for regular expressions? Here we summarize the common matching rules:

Schema description

W match alphanumeric and underline

W matches non alphanumeric and underline

S matches any white space character, equivalent to [tnRf]

S matches any non empty character

D matches any number, equivalent to [0-9]

D matches any non number

A match string start

The end of the Z match string. If there is a line break, only the end string before the line break is matched

End of Z match string

G matches where the last match was completed

N matches a line break

Tmatch a tab

^Match the beginning of the string

$matches the end of the string.

. match any character except line break. When the re.dotall tag is specified, any character including line break can be matched.

[…] is used to represent a set of characters, listed separately: [a m k] matches’ a ‘,’ m ‘or’ k ‘

2 characters not in []: 3 matches characters other than a, B, C.

*Matches 0 or more expressions.

+Matches one or more expressions.

? match 0 or 1 fragments defined by the previous regular expression, non greedy

{n} exactly matches n previous expressions.

{n, m} matches fragments defined by previous regular expressions n to m times, greedy way

A|b matches a or B

() matches the expression in parentheses and also represents a group

Maybe it’s a little dizzy after that. Don’t worry. Let’s explain the usage of some common rules in detail. How to use it to extract the information we want from the web page.

Using in Python

In fact, regular expressions are not unique to python, but they can also be used in other programming languages. However, the re Library of Python provides the implementation of the whole regular expression. We can use regular expressions in Python by using the re library, which is almost always used to write regular expressions in Python.

Let’s take a look at its usage.

match()

First of all, we introduce the first common matching method, match(). We pass in the string to be matched and the regular expression to this method to check whether the regular expression matches the string.

The match () method attempts to match the regular expression from the beginning of the string. If it does, it returns the result of the match. If it does not, it returns none.

Let’s use an example to feel:


import re 
content = 'Hello 123 4567 World_This is a Regex Demo' 
print(len(content)) 
reresult = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}', content) 
print(result) 
print(result.group()) 
print(result.span()) 

Operation result:

41 
<_sre.SRE_Match object; span=(0, 25), match=’Hello 123 4567 World_This’> 
Hello 123 4567 World_This 
(0, 25)

Here we first declare a string, including English letters, white space characters, numbers and so on. Then we write a regular expression ^ hellosddsd {4} SW {10} to match the long string.

The ^ at the beginning is the beginning of the matching string, that is to say, it starts with Hello, then s matches the blank character, which is used to match the space of the target string, D matches the number, three D matches 123, then writes an s matches the space, and then 4567, we can still use four D matches, but it’s cumbersome to write, so we can follow {4} to match the front Four times of characters, that is to say, four numbers are matched, so that the matching can be completed, followed by a blank character, and then w {10} matches 10 letters and underscores. This is the end of the regular expression. We notice that the target string is not matched completely, but it can still be matched, but the matching result is just a little shorter.

We call the match () method. The first parameter passes in a regular expression, and the second parameter passes in a string to match.

Print out the results. You can see that the result is an sre_match object, which proves a successful match. There are two methods. The group() method can output the matched content, and the result is hello 123 4567 world_this. This is exactly what our regular expression rules match. The span() method can output the matched range, and the result is (0, 25). This is the matched result character The range of positions of the string in the original string.

Through the above example, we can basically understand how to use regular expressions to match a piece of text in Python.

Matching target

Just now we used the match () method to get the matched string content, but what if we want to extract part of the content from the string? Just like the first instance, extract content such as email or phone number from a piece of text.

Here we can use () brackets to enclose the substring we want to extract, () actually marks the beginning and end of a subexpression. Each marked subexpression will correspond to each group in turn. We can call the group () method to pass in the index of the group to get the extracted result.

Here is an example:


import re 
content = 'Hello 1234567 World_This is a Regex Demo' 
reresult = re.match('^Hello\s(\d+)\sWorld', content) 
print(result) 
print(result.group()) 
print(result.group(1)) 
print(result.span()) 

It is still the previous string. Here we want to match the string and extract 1234567. Here we enclose the regular expression of the number part with (), and then call group (1) to get the matching result.

The operation results are as follows:

<_sre.SRE_Match object; span=(0, 19), match=’Hello 1234567 World’> 
Hello 1234567 World 
1234567 
(0, 19)

It can be seen that 1234567 was successfully obtained in the result. We use group (1) to get the result. Different from group (), group () will output the complete matching result, while group (1) will output the first matching result surrounded by (). If there is something () included after the regular expression, then we can use group (2), group (3) to get the result in turn.

Universal matching

The regular expression we wrote just now is actually quite complex. When there is a blank character, we write s to match the blank character, and when there is a number, we write d to match the number. The workload is very heavy. In fact, there is no need to do so at all. There is also a universal matching that can be used, that is, it can match any character (except the line break), and it represents that the characters in front of it are matched infinite times, so they are grouped Together, we can match any character. With it, we don’t need to match character by character.

So in the example above, we can rewrite the regular expression.


import re 
content = 'Hello 123 4567 World_This is a Regex Demo' 
reresult = re.match('^Hello.*Demo$', content) 
print(result) 
print(result.group()) 
print(result.span()) 

Here, we will directly omit the middle part and replace it with. * and add an ending string. The operation results are as follows:

<_sre.SRE_Match object; span=(0, 41), match=’Hello 123 4567 World_This is a Regex Demo’> 
Hello 123 4567 World_This is a Regex Demo 
(0, 41)

You can see that the group () method outputs all the matched strings, that is to say, the regular expression we wrote matches all the contents of the target string. The span () method outputs (0, 41), which is the length of the whole string.

Therefore, we can use. * to simplify the writing of regular expressions.

Greedy matching and non greedy matching

When using the general matching. * above, we may not get the desired result sometimes. Let’s look at the following example:


import re 
content = 'Hello 1234567 World_This is a Regex Demo' 
reresult = re.match('^He.*(\d+).*Demo$', content) 
print(result) 
print(result.group(1)) 

Here we still want to get the middle number, so we still write (D +) in the middle. Because the content on both sides of the number is quite messy, we want to omit and write. Finally, ^ he. (D +). * demo $. It seems that there is no problem. Let’s see the operation result:


<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>

7

Strange things happened. We only got the number 7. What’s the matter?

Here’s a reason for greedy matching and non greedy matching. In greedy matching, we will match as many characters as possible. In our regular expression, D + is followed by at least one number, and no specific number is specified, so * matches as many characters as possible, so it also matches 123456, leaving a number 7 that can meet the condition for D +, So what D + gets is the number seven.

But this will obviously bring us a lot of inconvenience in matching, sometimes the matching results will be inexplicably less part of the content. In fact, we only need to use non greedy matching here. The writing method of non greedy matching is. *?, one more? So what kind of effect can it achieve? Let’s take another example:


import re 
content = 'Hello 1234567 World_This is a Regex Demo' 
reresult = re.match('^He.*?(\d+).*Demo$', content) 
print(result) 
print(result.group(1)) 

Here we just change the first. To… To non greedy match. The results are as follows:


<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'> 

Good. Now we can get 1234567 successfully. The reason can be imagined that greedy matching is to match as many characters as possible, non greedy matching is to match as few characters as possible, D + is used to match the number after…, when…? is matched to the blank character after Hello, then the next character is the number, and D + just can match, then…? no longer matches, and give D + to match the following number. So, in this way, if…? matches as few characters as possible, the result of D + is 1234567.

Therefore, when matching, we can try our best to use non greedy matching in the middle of string, that is to say, use “? Instead of…” to avoid missing matching results.

But note here that if the result of the match is at the end of the string,. *? May not match anything, because it will match as few characters as possible, such as:


import re 
content = 'http://weibo.com/comment/kEraCN' 
reresult1 = re.match('http.*?comment/(.*?)', content) 
reresult2 = re.match('http.*?comment/(.*)', content) 
print('result1', result1.group(1)) 
print('result2', result2.group(1)) 

Operation result:

result1

result2 kEraCN

It is observed that no result is matched by. And. Matches as many contents as possible, and the matching result is obtained successfully.

So here we have a good understanding of the principle of greedy matching and non greedy matching, which is very helpful to write regular expressions later.

Modifier

Regular expressions can contain optional flag modifiers to control the matching pattern. The modifier is specified as an optional flag.

Let’s start with an example:


import re 
content = '''Hello 1234567 World_This 
is a Regex Demo 
''' 
reresult = re.match('^He.*?(\d+).*?Demo$', content) 
print(result.group(1)) 

Similar to the above example, we add a newline character to the string, and the regular expression is the same to match the numbers in the string. Let’s see the running result:


AttributeError Traceback (most recent call last) 
<ipython-input-18-c7d232b39645> in <module>() 
   5 ''' 
   6 reresult = re.match('^He.*?(\d+).*?Demo$', content) 
----> 7 print(result.group(1)) 
AttributeError: 'NoneType' object has no attribute 'group' 

In other words, the regular expression does not match the string, and the returned result is none. However, we call the group () method, which results in attributeerror.

Then why can’t we add a line break to match it? Because. Matches any character except line break. When a line break is encountered,. *? Cannot be matched, so the matching fails.

So here we just need to add a modifier re. S to correct this error.


reresult = re.match('^He.*?(\d+).*?Demo$', content, re.S) 

The third parameter of the match () method is passed in re. S, which is used to match. With all characters including line breaks.

Operation result:

1234567

This re. S is often used in web page matching, because HTML nodes often have line breaks, plus it, we can match the line breaks between nodes.

There are also modifiers that you can use if necessary:

Modifier description

Re. I makes matching case insensitive

Re. L for local aware matching

Re. M multiline matching, affecting ^ and$

Re. S enables. To match all characters including line breaks

Re. U parses characters based on the Unicode character set. This sign affects W, W, B, B

Re. X this flag allows you to write regular expressions more easily by giving you a more flexible format.

Re. S and re. I are commonly used in web page matching.

Escape matching

We know that regular expressions define many matching patterns, such as. Matching any character except line breaks, but if it is included in the target string, how can we match it?

So we need to use escape matching here. Let’s use an example to feel it:

import re 
Content = '(Baidu) www.baidu. Com' 
Reresult = re. Match ('\ (Baidu \) www \. Baidu \. Com', content) 
print(result)

When we encounter special characters for regular matching patterns, we can match them by escaping them with a backslash. For example, we can use. To match and run the result:

< u SRE. SRE \ u match object; span = (0, 17), match = ‘(Baidu) www.baidu. Com’ >

You can see that the original string was successfully matched.

The above are some common knowledge points for writing regular expressions. Mastering the above knowledge points is very helpful for later writing regular expression matching.

search()

We mentioned earlier that the match () method starts from the beginning of the string. Once the beginning does not match, the whole match fails.

Let’s look at the following example:


import re 
content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings' 
reresult = re.match('Hello.*?(\d+).*?Demo', content) 
print(result)

Here we have a string. It starts with extra, but we start with hello for regular expression. The whole regular expression is a part of the string, but the matching fails. That is to say, as long as the first character does not match the whole matching, it cannot succeed. The running results are as follows:

None

So the match () method needs to consider the beginning content when we use it, so it is not so convenient when matching. It is suitable to detect whether a string conforms to the rules of a regular expression.

So here is another method search (), which will scan the whole string when matching, and then return the first successful matching result, that is to say, regular expression can be a part of the string. When matching, the search () method will scan the string in turn until it finds the first matching string, and then return the matching content. If the search is finished If you haven’t found it, go back to none.

Let’s change the match () method in the above code to search (), and then look at the running results:


<_sre.SRE_Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'> 

This results in a match.

So, for the convenience of matching, we can try to use the search () method.

Let’s use a few more examples to feel the usage of the search () method.

First, there is a section of HTML text to be matched. Next, we write several regular expression instances to extract the corresponding information.

html = '''<div> 
  <h2>Classic old songs</h2> 
  <p> 
    List of classic songs 
  </p> 
  <ul> 
    < Li data view = "2" > have you all the way</li> 
    <li data-view="7"> 
      < a href = "/ 2. Mp3" rel = "external nofollow" singer = "Ren Xianqi" > Canghai smiles</a> 
    </li> 
    <li data-view="4"> 
      < a href = "/ 3. Mp3" rel = "external nofollow" singer = "Qi Qin" > the past goes with the wind</a> 
    </li> 
    < Li data view = "6" ></a></li> 
    < Li data view = "5" ></a></li> 
    <li data-view="5"> 
      < a href = "/ 6. Mp3" rel = "external nofollow" singer = "Teresa" ></a> 
    </li> 
  </ul> 
</div>'''

It is observed that there are many < li > nodes in the < UL > node, some of which contain nodes, some do not contain nodes, some of which have corresponding attributes, hyperlinks and singer names.

First, we try to extract the singer name and song name contained in the hyperlink within the < li > node whose class is active.

So we need to extract the singer attribute and text of the node under the third < li > node.

So the regular expression can start with < li > and then look for a flag active. The middle part can be matched with?, and then we need to extract the attribute value of singer, so we need to write singer = “(.?)”, and the parts we need to extract are surrounded by brackets, so that we can easily extract them with group () method. Its two sides are double quotation marks, and then Next, we need to match the text of the node. The left boundary is > and the right boundary is. So let’s specify the left and right boundaries, and then the target content is still matched with (.?). So the final regular expression becomes < Li.? active.? singer = “(.?)” > (. *?) ‘, and then we call the search () method, which will search the entire HTML text and find the one that matches the regular expression First content returned.

In addition, because the code has line breaks, the third parameter here needs to be passed in re. S

So the whole matching code is as follows:


reresult = re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>', html, re.S) 
if result: 
  print(result.group(1), result.group(2))

Since the singers and song names we need to get are surrounded by brackets, we can use the group () method to get them, and the sequence number corresponds to the parameters of group () in turn.

Operation result:

The past of Qi and Qin followed the wind

You can see that this is exactly the singer name and song name contained in the hyperlink within the < li > node whose class is active.

What about regular expressions without active? That is to say, to match the node content without class as active, we will remove the active from the regular expression and rewrite the code as follows:


reresult = re.search('<li.*?singer="(.*?)">(.*?)</a>', html, re.S) 
if result: 
  print(result.group(1), result.group(2))

Since the search () method returns the first matching target, the result changes here.

The operation results are as follows:

Ren Xianqi smiles

Since we remove the active tag and start searching from the beginning of the string, the eligible nodes become the second < li > node, and the later nodes will not be matched, so the running result naturally becomes the content of the second < li > node.

Note that in the above two matches, we added re. S to the third parameter of the search () method, so that. *? Can match newline, so the < li > node with newline is matched. If we remove it, what will be the result?


reresult = re.search('<li.*?singer="(.*?)">(.*?)</a>', html) 
if result: 
  print(result.group(1), result.group(2)) 

Operation result:

Beyond glorious years

As you can see, the result becomes the content of the fourth < li > node. This is because the second and third < li > tags contain line breaks. After re. S is removed,. *? Can no longer match the line breaks, so the regular expression will not match the second and third < Li > nodes, and the fourth < li > node does not contain line breaks, so the match is successful.

Because most HTML text contains line breaks, we need to add re. S modifiers as much as possible through the above examples to avoid mismatching.

findall()

We talked about the use of the search () method earlier, which can return the first content matching the regular expression, but what if we want to get all the content matching the regular expression? You need to use the findall () method.

The findall () method searches the entire string and returns everything that matches the regular expression.

If we want to get hyperlinks, singers and song names of all nodes, we can change the search () method to findall () method. If there is any returned result, it is the list type, so we need to traverse the list to get each group of contents in turn.


reresults = re.findall('<li.*?href="https://ask.hellobi.com/(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S) 
print(results) 
print(type(results)) 
for result in results: 
  print(result) 
  print(result[0], result[1], result[2]) 

Operation result:

[(‘/ 2. MP3’, ‘Ren Xianqi’, ‘a smile from the sea’), (‘/ 3. MP3’, ‘Qi Qin’, ‘past events follow the wind’), (‘/ 4. MP3’, ‘beyond’, ‘glorious years’), (‘ / 5. MP3 ‘,’ Chen Huilin ‘,’ Notepad ‘), (‘ / 6. MP3 ‘,’ Deng Lijun ‘,’ hope for a long time ‘)]
<class ‘list’> 
(‘/ 2. MP3’, ‘Ren Xianqi’, ‘the sea laughs’)
/2. MP3 Ren Xianqi smiles
(‘/ 3. MP3’, ‘Qi Qin’, ‘past events follow the wind’)
/3. MP3 Qi Qin
(‘/ 4. MP3’, ‘beyond’, ‘glorious years’)
/4. MP3 beyond
(‘/ 5. MP3’, ‘Chen Huilin’, ‘Notepad’)
/5. MP3 Chen Huilin Notepad
(‘/ 6. MP3’, ‘Teresa Teng’, ‘I wish you a long time’)
/6. MP3, Teresa Deng hopes for a long time

As you can see, each element of the returned list is of tuple type. We can take it out one by one with the corresponding index.

Therefore, if you only get the first content, you can use the search () method. When you need to extract multiple content, you can use the findall () method.

sub()

In addition to extracting information from regular expressions, we sometimes need to use it to modify text. For example, we want to remove all the numbers in a string of text. If we only use the replace() method of string, it is too cumbersome. Here we can use the sub() method.

Let’s use an example to feel:


import re 
content = '54aK54yr5oiR54ix5L2g' 
content = re.sub('\d+', '', content) 
print(content) 

Operation result:

aKyroiRixLg

Here we just need to pass in D + in the first parameter to match all the numbers, and then the second parameter is the string replaced by. If we want to remove it, we can assign it as null. The third parameter is the original string.

The result is to replace the modified content.

In the above HTML text, if we want to get the song names of all < li > nodes regularly, it may be tedious to extract them directly with regular expressions. For example, it can be written like this:


reresults = re.findall('<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>', html, re.S) 
for result in results: 
  print(result[1]) 

Operation result:

With you all the way
The sea laughs
Past events follow the wind
Glorious years
Notepad
May we all be blessed with longevity

But if we use the sub() function to make it easier, we can use the sub() function to get rid of the node, leave only the text, and then use findall() to extract it.


html = re.sub('<a.*?>|</a>', '', html) 
print(html) 
reresults = re.findall('<li.*?>(.*?)</li>', html, re.S) 
for result in results: 
  print(result.strip()) 

Operation result:

<div> 
  <h2>Classic old songs</h2> 
  <p> 
    List of classic songs 
  </p> 
  <ul> 
    < Li data view = "2" > have you all the way</li> 
    <li data-view="7"> 
      A smile from the sea 
    </li> 
    <li data-view="4"> 
      Past wind 
    </li> 
    < Li data view = "6" > glorious years</li> 
    < Li data view = "5" > Notepad</li> 
    <li data-view="5"> 
      May we all be blessed with longevity 
    </li> 
  </ul> 
</div>

With you all the way
The sea laughs
Past events follow the wind
Glorious years
Notepad
May we all be blessed with longevity

It can be seen that the tags are all gone after being processed by the sub() function, and then they can be extracted directly by findall(). So when appropriate, we can do some corresponding processing with the help of sub () method, which can get twice the result with half the effort.

compile()

The methods we mentioned above are all used to process strings. At last, we introduce a compile () method, which can compile regular strings into regular expression objects for reuse in later matching.


import re 
content1 = '2016-12-15 12:00' 
content2 = '2016-12-17 12:55' 
content3 = '2016-12-22 13:21' 
pattern = re.compile('\d{2}:\d{2}') 
reresult1 = re.sub(pattern, '', content1) 
reresult2 = re.sub(pattern, '', content2) 
reresult3 = re.sub(pattern, '', content3) 
print(result1, result2, result3)

For example, there are three dates. We want to get rid of the time in the three dates respectively. So we can use the sub () method here. The first parameter of the sub () method is regular expression. But we don’t need to write the same three regular expressions. So we can use the compile () function to compile regular expression into a regular expression object Reuse.

Operation result:

2016-12-15  2016-12-17  2016-12-22 

In addition, compile() can also pass in modifiers, such as re. S, so that there is no need for extra transfer in search(), findall(), and other methods. So the compile () method can be said to make a layer of encapsulation for regular expressions, so as to facilitate our better reuse.

So far, the basic usage of regular expressions has been introduced. Later, we will have a real battle to explain the use of regular expressions.

summary

The above is the basic usage of regular expression in Python 3 introduced by Xiaobian to you. I hope it can help you. If you have any questions, please leave a message to me, and Xiaobian will reply to you in time. Thank you very much for your support of the developepaer website!
If you think this article is helpful to you, welcome to reprint, please indicate the source, thank you!