Talking about Python: consolidating the foundation of Python


Regular expressions are used in general back-end languages. I think the difficulty of regular expressions is that it is difficult to remember more metacharacters, so I suggest you collect them. In the programming of python, especially in the data filtering and cleaning operations after crawler, a lot of operations need to be carried out on strings. The use of regular expression is undoubtedly the most convenient string processing operation.

1. Metacharacter and assembly form of regular expression

First, a regular expression is a string composed of one metacharacter after another. Then, the string is used as a format to match any string. Finally, a new data text is generated. So, let’s take a look at the metacharacters like this in Python.

1. Match any character (excluding newline)
 2 ^ match the start position, and match the start of each line in multi line mode
 3 $to match the end position, and match the end of each line in multi line mode
 4 * matches the previous metacharacter 0 to more than once
 5 + matches the previous metacharacter one to more times
 6? Match the previous metacharacter 0 to 1 times
 7{m,n}                  Match the previous metacharacter m to N times
 8. Escape character. The following character will lose its meaning as a special metacharacter. For example, it can only match. It can no longer match any character
 9 [] character set, a set of characters, which can match any of them
10 | logical expression | or, for example, | a | B | stands for matching | a | or | B
11 (...) group, the default is capture, that is, the grouped content can be taken out separately. By default, each group has an index, starting from "1", and the index value is determined in the order of "("
12 (? Ilmsux) grouping can set mode, each character in ilmsux represents a mode, see "mode" I for usage
13 (?:...) group, which will be skipped when calculating index
14(?P<name>...)          The naming mode of a group. You can use index or name when you take the content in this group
15 (? P = name) grouping of reference patterns, which can be used to refer to the previously named regular expressions in the same regular expression
16 (#...) note, does not affect other parts of the regular expression, usage see "mode" I
17 (? =...) Order positive look, indicating that the right side of the position can match the regular in brackets
18 (?!...) sequence negative look, indicating that the right side of the position cannot match the regular in brackets
19 (? < =...) positive look in reverse order, indicating that the left side of the position can match the regular in brackets
20 (? <!...) negative look in reverse order, indicating that the left side of the position cannot match the regular look in brackets
21 ((ID / name) yes | no) if the partition of the previous specified ID or name is successfully matched, the regularization at yes is executed, otherwise the regularization at no is executed
22number matches the same string captured by the previous group with index number
23a ﹣ match string start position, ignore multiline pattern
24Z ﹣ matches the end of the string, ignoring the multiline pattern
25B ﹣ matches an empty string at the beginning or end of a word
26B                     Matches an empty string that is not at the beginning or end of a word
27D ﹣ matches a number, equivalent to [0-9]
28d ﹣ matching non number, equivalent to [^ 0-9]
29s ﹣ matches any white space character, equivalent to [tnrfv]
30s ﹣ matches non white space characters, equivalent to [^ tnrfv]
31w                     Match any character in number, letter or underline,   amount to  [a-zA-Z0-9_]
32W ﹣ matches any character in non number, letter and underline, equivalent to [^ a-za-z0-9 ﹣ u]

Source description: the above metacharacter table is quoted from “blog Garden, rookie’s daily life”

2、python   The built-in module re operates on regular expression patterns

The re module of Python provides many built-in functions to operate regular expressions. Mastering regular expressions and the use of these functions will greatly improve work efficiency, because most of the time we are dealing with strings. And re module can specify different types of patterns when using these functions to operate regular expressions. For example, the flags parameter in the “compile (pattern, flags = 0) function of the re built-in module is used to specify the matching pattern. By default, the pattern of each built-in function is equal to zero, that is, no pattern is specified.

1re. I make matching case insensitive
2re.L      Do local aware matching
3re. M multi line matching, affecting ^ and$
4re. S ﹣ causes. To match all characters including line feed
5re. U] parses characters according to the Unicode character set. This symbol affects W, W, B, B
6RE. X ﹣ this flag allows you to write regular expressions more easily by giving you more flexible formats.

3. The re module function matches the regular expression operation string

Compile (pattern, flags = 0) precompiles the matching regular expressions. This mode is conducive to the subsequent reuse of the same matching rules. Of course, it is also possible to use the built-in functions of re module directly without precompiling.

Findall (pattern, string, flags = 0) finds all the matched data and returns the matched data in the form of a list.

1 ᦇ import ᦇ built in module
 2import re
 3. Define multiline string SC
 4sc = '''str1
 8#   use  compile()  Function for regular expression precompiling
 9 # the regular expression is <. + > reference metacharacter table, in which <. > means to match any character (excluding newline character)
10 ᦇ < + > means to match the previous metacharacter one or more times, and the combination <. + > means to match all characters, excluding the newline character
11pattern = re.compile(".+")
12 # regular expression object calls findall() function to return the list
13print pattern.findall(sc)
Print the return list. The result should be the following list
15['str1', 'str2', 'str3']

Split (pattern, string, maxsplit = 0, flags = 0) takes the data matched by the regular expression as the cutting point to segment the original data and return a data list.

1#   Define the original string
2sc = '''strw 1 laow
3    strd 2 laow
4    strc 3 laow'''
5#   Segment by number
6print re.split('d+', sc)
7. Print results
8['strw ', ' laown    strd ', ' laown    strc ', ' laow']

Sub (pattern, repl, string, count = 0, flags = 0) matches the string of the regular expression and replaces it with the specified string.

Import the built-in module
2import re
3. Define the original string
4sc = "the sum of 6 and 9 is [6+9]."
5 # replace [6 + 9] with 15
6print re.sub('[6+9]', '15', sc)
7. Print results
8the sum of 7 and 9 is 15.

Search (pattern, string, flags = 0) to find the regular expression content.

Import the built-in module
 2import re
 3. Define raw data
 4sc = '''strw 1 laow
 5    strd 2 laow
 6    strc 3 laow'''
 7 # find the first string that starts with # t #
 8sc_res ='tw+', sc)
 9 # you must use the group() function that returns the object to print out the value, because the search() function returns the object

Escape (pattern) string escape function. When the string to be processed contains metacharacters in the regular expression, the original regular expression must be escaped, otherwise the matching will be incorrect.

Import the built-in module
 2import re
 3. Define the original string
 4sc = ".+d222"
 5. Escape the regular expression <. + D222 >
 6pattern_str = re.escape(".+d222")
 7#   Print regular expression after transfer
 8print pattern_str
 9. Print results
11 # print the matched string
12print re.findall(pattern_str, sc)
13 print results

4. Summary

In addition to the use of common built-in functions listed in the third point above, there are also the use of regular expressions built in Python, as well as the use of grouping and looking around. Can help us deal with most of the string problems, regular expressions in Python is very powerful, but also the use of back-end language is relatively high, I suggest you collect.

More exciting to WeChat official account [Python concentration camp], focus on back-end programming, real battle, original articles updated every day!

Talking about Python: consolidating the foundation of Python

Recommended Today

Envoy announced alpha version of native support for windows

Author: sunjay Bhatia Since 2016, porting envoy to the windows platform has been an important part of the projectOne of the goalsToday, we are excited to announce the alpha version of envoy’s windows native support. The contributor community has been working hard to bring the rich features of envoy to windows, which is another step […]