Python Standard Library 19. Text Processing Services: Re Regular Expression Operations

Time:2019-8-13

Last article: Python Standard Library – 18, Text Processing Services: String Common String Operations
Next article:

This module provides regular expression matching operations similar to Perl.

The pattern and the searched string can be either a Unicode string (str) or an 8-bit byte string (bytes). However, Unicode strings cannot be mixed with 8-bit byte strings: that is, you can’t match Unicode strings with a byte string pattern, and vice versa; similarly, when substitution operations are performed, the type of substitution strings must be the same as the pattern used and the type of search strings.

Regular expressions use backslashes (”) to represent special forms, or to convert special characters into ordinary characters. The backslash has the same effect in ordinary Python strings, so there is a conflict. For example, to match a literal backslash, the regular expression pattern has to be written as’\\\\\\ because a backslash matching in a regular expression must be \, and each backslash must be written as \ in a regular Python string.

The solution is to use Python’s original string representation for regular expression styles; backslashes don’t need to be treated specially in string literals with’r’prefix. Therefore, R “n” denotes a string containing”and’n’, while “n” denotes a string containing only one newline character. Styles are usually represented in Python code using this original string representation.

Most regular expression operations provide modular functions and methods for compiling regular expressions. These functions are a shortcut and do not need to compile a regular object first, but some optimization parameters are lost.

See also

The third-party module regex provides an API interface compatible with the standard library re module, as well as additional functions and more comprehensive Unicode support.

regular expression syntax

A regular expression (or RE) specifies a set of matching strings; functions in a module allow you to check whether a string matches a given regular expression (or whether a regular expression matches a string, which means the same thing).

Regular expressions can be spliced; if both A and B are regular expressions, then AB is also regular expressions. Typically, if the string P matches A and another string Q matches B, then PQ matches AB. Unless A or B contains low priority operations, A and B have boundary conditions; or named group references. Therefore, complex expressions can be easily constructed from simple source language expressions described here. Learn more about regular expression theory and implementation, refer to the Friedl book [Frie09], or other books built by compilers.

The following is a brief description of the regular expression format. For more detailed information and demonstration, refer to the regular expression HOWTO.

Regular expressions can contain common or special characters. Most common characters, such as’A’,’a’, or’0′, are the simplest regular expressions. They match themselves. You can stitch common characters, so last matches the string’last’. (In the rest of this section, we’ll use this special style to represent regular expressions, usually without quotes. The string to be matched is’in single quotes’, in single quotes.)

Some characters, such as’|’or'(‘), belong to special characters. Special characters can not only express its general meaning, but also affect the interpretation of regular expressions beside them.

Repetitive modifier(+,?, {m, n}, etc.) cannot be nested directly. This avoids the ambiguity of non-greedy suffixes? Modifiers and other modifiers in implementations. To apply an inner repeated nesting, parentheses can be used. For example, the expression (?: a {6})Match 6’a’characters and repeat them any number of times.

Special characters are:

  • .

    (Point) In default mode, any character except newline is matched. If the label DOTALL is specified, it will match any character including newline characters.

  • ^

    (Insert symbol) Matches the beginning of the string and the first symbol after line break in MULTILINE mode.

  • $

    Matches the end of a string or the first character of a newline character, and matches the first character of a newline character in the MULTILINE pattern. Foo matches’foo’and’foobar’, but regular foo $matches only’foo’. More interestingly, a search for foo. $in’foo1 n foo2 n’usually matches’foo2′, but in MULTILINE mode,’foo1′ can be matched; a search for $in’foon’will find two empty strings: one before a line break and one at the end of the string.

  • *

    Repeat the previous regular matching 0 to any number of times and match as many strings as possible. Ab* matches’a’,’ab’, or’a’.Following any one‘b’。

  • +

    Repeat 1 to any number of times for the regular matching in front of it. Ab + matches’a’followed by more than one to any’b’, and it does not match’a’.

  • ?

    The regular matching in front of it is 0 to 1 repetition. Ab? Matches’a’or’ab’.

  • *?, +?, ??

    The’,’+’, and’?’modifiers are greedy; they match as many strings as possible. Sometimes this behavior is not required. If the regular <.> Hope to find’b < C >’, it will match the whole string, not just’. Adding? After the modifier will make the style non-greedyOr: dfn:Match in the smallest way; the smallest number of characters will be matched. Using the regular <. *?> will only match”’.

  • “{m}”

It specifies m repeats of the previous regular matching; less than m will result in a matching failure. For example, a {6} will match six’a’, but not five.

  • “{m, n}”

Matching regularities m to N times and taking as many as possible between M and n. For example, a {3,5} will match 3 to 5’a’. Ignoring m means that the specified lower bound is 0 and ignoring n means that the specified upper bound is infinite. For example, a {4,} B will match’a a aaaab’or 1000’a’ followed by’b’, but not’aaab’. A comma cannot be omitted, otherwise it is impossible to tell which boundary the modifier should ignore.

  • {m,n}?

The non-greedy pattern of the former modifier matches only as few characters as possible. For example, for’aaaaaa’, a {3,5} matches five’a’, while a {3,5}? Only matches three’a’.

  • \

Escape special characters (allowing you to match’*’,’?’, or something like that) or represent a special sequence; the special sequence is discussed later.

If you do not use the original string (r’raw’) to express the style, remember that Python also uses backslashes as escape sequences; if the escape sequence is not recognized by Python’s analyzer, backslashes and characters can only appear in the string. If Python can recognize this sequence, the backslash should be repeated twice. This can lead to a barrier to understanding, so it is highly recommended that even the simplest expression use the original string.

  • []

Used to represent a character set. In a set:

* Characters can be listed separately, such as [a m k] matching'a','m', or'k'.

* Character ranges can be represented by connecting two characters by'-'. For example, [a-z] will match any lowercase ASCII character, [0-5] [0-9] will match two digits from 00 to 59, [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a z]) or its position is at the beginning or end (e.g. [-a] or [a-]), it simply represents the common character'-'.

* Special characters in a set lose their special meaning. For example, [(+*)] only matches these grammatical characters'(','+','*', or').

* Character classes such as  w or  S (as defined below) are acceptable within a collection, and the characters they can match are determined by ASCII or LOCALE patterns.

* Characters that are not within the scope of the set can be matched by reversing them. If the first character of the set is'^', all characters that are not in the set will be matched. For example, [^ 5] will match all characters except'5', [^] will match all characters except'^'. ^ If it is not in the first place of the set, it has no special meaning.

* There are two ways to match a character']'in a collection, either by adding a backslash before it or by putting it at the top of the collection. For example, [()[]{}] and []()[{}] can match parentheses.
* Nested set and set operation support in Unicode Technical Standard 18 may be added in the future. This will change the grammar, so to help with this change, a FutureWarning will be raised in ambiguous situations, including the following situations, where the set starts with'['or contains the following character sequences'--','& &','~', and'|'. To avoid warnings, they need to be escaped with backslashes.

Change in Version 3.7: A FutureWarning will raise if the semantics of a string construction will change in the future.

  • |

A | B, A and B can be any regular expression, create a regular expression, match A or B. Any regular expression can be connected with’|’. It can also be used in combinations (see below). When scanning the target string,’|’separated regular styles are matched from left to right. When a style matches perfectly, the branch is accepted. This means that once A matches successfully, B no longer matches, even if it produces a better match. Or,’|’operators are never greedy. If you want to match the’|’character, use | or include it in the character set, such as [|].

  • (…)

(Combination) Matches any regular expression in parentheses and identifies the beginning and end of the combination. After the matching is completed, the contents of the combination can be acquired, and can be matched again with the number escape sequence later, and then explained in detail. To match characters'(‘or’)’, use (or) or include them in the character set: [(], [)].

  • (?…)

This is an extended markup (a’?’follow'(‘ has no meaning). The first character after’?’determines the grammar used for this construction. Such extensions usually do not create new combinations; (? P < name >…) is the only exception. The following are the extensions currently supported.

  • (?aiLmsux)

(‘a’,’i’,’L’,’m’,’s’,’u’,’x’) this combination matches one or more empty strings; these characters set the following markers for regular expressions: re.A (only matching ASCII characters), re.I (ignoring case), re.L (language dependency), re.M (multi-line mode), re.S (dot matching all characters), Unire.U (code matching), an D re.X (verbose mode). (These tags are described in the module content.) If you want to include these tags in regular expressions, this method is very useful, eliminating the need to pass flag parameters in re. compile (). Tags should be represented in the first place of the expression string.

  • (?:…)

Non-captured versions of regular brackets. Matches any regular expression in parentheses, but the substring matched by the grouping cannot be retrieved after the match is performed or referenced later in the pattern.

  • (?aiLmsux-imsx:…)

(0 or more in’a’,’i’,’L’,’m’,’s’,’u’,’x’, and then optionally follow’-‘followed by’i’,’m’,’s’,’s’,’x’.) These characters set or remove the corresponding markers re.A (matching only ASCII), re.I (ignoring case), re.L (language dependency), re.M (multiple lines) for one part of the expression. Re.S (point matching all characters), re.U (Unicode matching), and re.X (verbose mode). (Markup description in module content.)

‘a’,’L’and’u’ are mutually exclusive as inline markers, so they cannot be combined or followed by’-‘. When one of them appears in an inline group, it covers the matching patterns in parentheses. In Unicode style, (? A:…) switch to match only ASCII, (? U:…) switch to Unicode match (default). In byte style (? L:…) switch to language dependent mode, (? A:…) switch to match only ASCII (default). This method only covers the matching within the combination, and the matching pattern outside parentheses is not affected.

3.6 New Version Function.

Change in Version 3.7: Symbols’a’,’L’and’u’ can also be used in a combination.

  • (?P<name>…)

(Named Combinations) Similar to regular combinations, but the matched substring groups are acquired externally by a defined name. The combination name must be a valid Python identifier, and each combination name can only be defined by one regular expression, only once. A combination of symbols is also a combination of numbers, just as the combination is not named.

Named combinations can be referenced in three contexts. If the style is (? P < quote >[‘”]). *?(? P = quote) (that is, matching strings enclosed in single or double quotes):

Python Standard Library 19. Text Processing Services: Re Regular Expression Operations

  • (?P=name)

A naming combination is inverted; it matches the same string matched in the previous naming group called name.

  • (?#…)

Notes; the contents will be ignored.

  • (?=…)

Match… Content, but not consumption style content. This is called lookahead assertion. For example, Isaac (?= Asimov) matches’Isaac’only when’Asimov’ follows.

  • (?!…)

Match… Non-conformity. This is called negative lookahead assertion. For example, Isaac (?! Asimov) matches’Isaac’only when it is not’Asimov’.

  • (?<=…)

Matches the current position of the string, its front matches… Content to the current location. This is called: dfn: positive look behind assertion. (?<= abc) def finds a match in’abcdef’, because the rearview looks back at three characters and checks whether the matching style is included. The included matching style must be of fixed length, meaning that ABC or a | B are allowed, but a * and a {3,4} are not. Note that the style that starts with positive lookbehind assertions, such as (?<=abc) def, does not start with a, but looks back from D. You might prefer to use the search () function rather than the match () function:

    >>> import re
    >>> m = re.search('(?<=abc)def', 'abcdef')
    >>> m.group(0)
    'def'

This example searches for a word that follows a hyphen:

    >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
    >>> m.group(0)
    'egg'

Change in Version 3.5: Add support for fixed-length combination references.

  • (?<!…)

Not before matching the current location. Style. This is called: dfn: negative look behind assertion. Similar to the forward-looking and backward-looking assertion, the included pattern matching must be of fixed length. Styles that start with negative lookbehind assertion can be matched from where the string search begins.

  • (?(id/name)yes-pattern|no-pattern)

If a given ID or name exists, it will try to match yes-pattern. Otherwise, it will try to match no-pattern. No-pattern is optional or can be ignored. For example, (<)? ([email protected]+(?:.w+)(?(1)> $) is an email style match that matches'<[email protected]>’or’[email protected]’, but does not match'<[email protected]’, nor does it match’[email protected]’.

A special sequence of”and a character is listed below. If ordinary characters are not ASCII digits or ASCII letters, the regular style matches the second character. For example, & dollar; matching character’$’.

  • number

Match the combinations represented by numbers. Each parenthesis is a combination, numbered from the beginning. For example (. +) 1 matches’the’or’55 55′, but does not match’the’ (note the space behind the combination). This particular sequence can only be used to match the first 99 combinations. If the first digit of the number is 0, or if the number is three octal digits, it will not be regarded as a combination, but as an octal digit value. Within the'[‘and’]’character set, any numeric escape is considered a character.

  • A

Matches only the beginning of the string.

  • b

Matches an empty string, but only at the beginning or end of the word. A word is defined as a sequence of word characters. Note that usually B is defined as the boundary between W and W characters, or between W and the beginning/end of a string, meaning r’bfoob’matches’ foo’,’foo.’,'(foo)’,’bar foo baz’but does not match’foobar’ or’foo3′.

By default, Unicode letters and numbers are used in Unicode style, but can be changed with ASCII tags. If the LOCALE tag is set, the boundaries of the words are determined by the current language locale settings, and B denotes backspace characters to be compatible with Python string text.

  • B

Matches an empty string, but not at the beginning or end of the word. It means that r’pyB’matches’python‘,’py3′,’py2′, but does not match’py’,’py.’, or’py!’. B is the right or wrong choice for b, so Unicode-style words are made up of Unicode letters, numbers or underscores, although they can be changed with ASCII markers. If the LOCALE flag is used, the boundaries of the words are set by the current language region.

  • d

For Unicode (str) styles:

Match any Unicode decimal number (that is, the characters in the Unicode character directory [Nd]. This includes [0-9], and many other numeric characters. If the ASCII flag is set, only match [0-9].
For the 8-bit (bytes) style: Match any decimal number, that’s [0-9].

  • D

Matches any character that is not a decimal number. It’s D. If the ASCII flag is set, it is equivalent to1

  • s

For Unicode (str) style: matching any Unicode blank character (including [tnrfv]), there are many other characters, such as non-newline spaces agreed by different language typesetting rules. If ASCII is set, only match [tnrfv].
For 8-bit (bytes) styles: matching blank characters in ASCII is [tnrfv].

  • S

Matches any non-blank characters. That is, s takes the wrong. If the ASCII flag is set, it is equivalent to2

  • w

For Unicode (str) styles: Characters that match Unicode words contain most of the characters that make up the words, as well as numbers and underscores. If the ASCII flag is set, only match [a-zA-Z0-9_].
For 8-bit (bytes) styles: matching numbers and letters and underscores in ASCII characters is [a-zA-Z0-9_]. If the LOCALE tag is set, the numbers and letters and underscores of the current language area are matched.

  • W

Match any non-verbal characters. Yes or no. If the ASCII tag is set, it is equivalent to3。 If the LOCALE flag is set, the non-word characters in the current language region are matched.

  • Z

Matches only the end of the string.

Most of Python’s standard escape characters are also supported by regular expression analyzers. :

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \

(Note that B is used to represent the boundaries of words, and it only denotes backspace within the character set, such as [b].)

The’u’and’U’ escape sequences are only supported in Unicode style. In bytes, it will show an error. Unknown ASCII character escape sequences are retained for future use and will be treated as errors.

The octal escape is contained in a finite form. If the first digit is 0, or there are three octal digits, then it is considered octal escape. Others are treated as group references. For string text, octal escape has up to three digit lengths.

Change in Version 3.3: Added’u’and’U’ escape sequences.

Change in Version 3.6: Unknown escape consisting of”and an ASCII character is considered an error.

Module content

The module defines several functions, constants, and an exception. Some functions are simplified versions of the compiled regular expression method (with some features missing). Most important applications always compile regular expressions first and then operate on them.

Change in Version 3.6: The flag constant is now an instance of the RegexFlag class, which is a subclass of enum. IntFlag.

  • re.compile(pattern, flags=0)

Compiles the style of a regular expression into a regular expression object (regular object), which can be used for matching through its methods match (), search (), and other descriptions as follows.

The behavior of this expression can be changed by specifying the value of the tag. Values can be any of the following variables, which can be combined by bit OR operations (| operators).
sequence

    prog = re.compile(pattern)
    result = prog.match(string)

Equivalent to

    result = re.match(pattern, string)

If you need to use this regular expression many times, using re. compile () and saving the regular object for reuse can make the program more efficient.

annotation

Compiled styles and module-level functions are cached through re. compile (), so a few regular expressions are used without consideration of compilation.

  • re.A
  • re.ASCII

Let w, W, b, B, d, D, s and S match only ASCII, not Unicode. This is only valid for Unicode styles and will be ignored by byte styles. It corresponds to the inline marker (? A) in the previous grammar.

Note that in order to maintain backward compatibility, the re.U tag still exists (as well as its synonyms re.UNICODE and embedding form (? U), but these are redundant in Python 3 because the default string is Unicode (and Unicode matching does not allow byte to appear).

  • re.DEBUG

Displays debug information at compile time without inline markup.

  • re.I
  • re.IGNORECASE

Neglect case matching; expressions such as [A-Z] also match lowercase characters. Unicode matching (such as matching u) is also useful unless the re. ASCII tag is set to disable non-ASCII matching. The current language area will not change this tag unless the re. LOCALE tag is set. This corresponds to an inline marker (?I).

Note that when IGNORECASE tags are set to search for Unicode style combinations [a-z] or [A-Z], it will match 52 ASCII characters and four additional non-ASCII characters:’Xu'(U+0130, Latin capitals I with a dot on it),’pupil’ (U+0131, Latin capitals I without dots),’holmium'(U+017F, Latin capitals long s) and’K’ (U+0131, Latin capitals I without dots).+ 212A, Kelvin symbol). If you use ASCII tags, only match’a’to’z’ and’A’to’Z’.

  • re.L
  • re.LOCALE

W, W, b, B and case-sensitive matching is determined by the current language region. This tag is valid only for byte styles. This tag is not recommended because the language region mechanism is unreliable, it can only handle one “habit” at a time, and is only valid for 8 bytes. Unicode matching is enabled by default in Python 3 and can handle different languages. This corresponds to the inline marker (?L).

Change in Version 3.6: Re. LOCALE can only be used with byte style, and not with re. ASCII.

Change in Version 3.7: Compiler regular objects with re.LOCALE tags are no longer dependent on language locales at compile time. Language locale settings affect the results only when matching.

  • re.M
  • re.MULTILINE

After setting, the style character’^’matches the beginning of the string and the beginning of each line (the symbol immediately following the newline character); the style character’$’ matches the end of the string and the end of each line (the symbol before the newline character). By default,’^’matches the string header and’$’ matches the end of the string. Corresponding to inline markers (?M).

  • re.S
  • re.DOTALL

    Let’.’special characters match any character, including newline characters; without this tag,’. ‘matches any character other than newline characters. Corresponding to inline markers (?S).

  • re.X
  • re.VERBOSE

This tag allows you to write more readable and friendly regular expressions. By segmenting and adding comments. Blank symbols are ignored unless they are escaped in a character set or by backslashes or *?, (?: or (?) P <… > Within the group. When a line has # not in the character set and escape sequence, then all the characters after it are annotations.

This means that the following two regular expressions match a decimal digit equally:

    a = re.compile(r"""\d +  # the integral part
                       \.    # the decimal point
                       \d *  # some fractional digits""", re.X)
    b = re.compile(r"\d+\.\d*")

Corresponding to inline markers (?X).

re.search(pattern, string, flags=0)

Scan the entire string to find the first position of the matching style and return a matching object. If there is no match, return a None; note that this is different from finding a zero-length match.

  • re.match(pattern, string, flags=0)

If the zero or more characters starting with string match the regular expression style, a corresponding matching object is returned. If there is no match, return None; note that it is different from zero length matching.

Note that even in MULTILINE multiline mode, re. match () matches only the beginning of the string, not the beginning of each line.

If you want to locate strings anywhere, use search () instead (also refer to search () vs. match ())

  • re.fullmatch(pattern, string, flags=0)

If the entire string matches the regular expression style, a corresponding matching object is returned. Otherwise, return a None; note that this is different from zero length matching.

3.4 New Version Function.

  • re.split(pattern, string, maxsplit=0, flags=0)

Separate the string with pattern. If parentheses are captured in pattern, then all the text in the group is included in the list. If maxsplit is not zero, maxsplit is partitioned at most, and all the remaining characters are returned to the last element of the list.

    >>> re.split(r'\W+', 'Words, words, words.')
    ['Words', 'words', 'words', '']
    >>> re.split(r'(\W+)', 'Words, words, words.')
    ['Words', ', ', 'words', ', ', 'words', '.', '']
    >>> re.split(r'\W+', 'Words, words, words.', 1)
    ['Words', 'words, words.']
    >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
    ['0', '3', '9']

If there is a capture combination in the delimiter and matches to the beginning of the string, the result will start with an empty string. The same is true for the end.

    >>> re.split(r'(\W+)', '...words, words...')
    ['', '...', 'words', ', ', 'words', '...', '']

In this way, the separator group will appear in the same place in the result list.

Empty matching of styles will separate strings, but only in non-contiguous situations.

    >>> re.split(r'\b', 'Words, words, words.')
    ['', 'Words', ', ', 'words', ', ', 'words', '.']
    >>> re.split(r'\W*', '...words...')
    ['', '', 'w', 'o', 'r', 'd', 's', '', '']
    >>> re.split(r'(\W*)', '...words...')
    ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

Change in Version 3.1: Adding optional markup parameters.

Change in Version 3.7: Increased style separation for empty strings.

  • re.findall(pattern, string, flags=0)

The string is returned with a list of matches of patterns that are not repeated. The string is scanned from left to right, and the matches are returned in the order in which they are found. If there are one or more groups in a style, a combination list is returned; that is, a list of tuples (if there are more than one combination in the style). Empty matching is also included in the result.

Change in Version 3.7: Non-null matches can now appear after the previous null matches.

  • re.finditer(pattern, string, flags=0)

All non-repetitive matches in string are returned to save the matching object for an iterator. String scans from left to right and matches are arranged in order. Empty matching is also included in the result.

Change in Version 3.7: Non-null matches can now appear after the previous null matches.

  • re.sub(pattern, repl, string, count=0, flags=0)

Returns a string obtained by replacing the left-most non-overlapping pattern in string with repl. If the style is not found, return string unchanged. Repl can be a string or a function; if it is a string, any backslash escape sequence will be processed. That is to say, n will be converted to a newline character, r will be converted to a carriage return attachment, and so on. Unknown ASCII character escape sequences are retained for future use and will be treated as errors. Other unknown escape sequences, such as & will remain the same. Backward references like 6 are replaced by substrings matched by the sixth set of styles. For example:

    >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
    ...        r'static PyObject*\npy_(void)\n{',
    ...        'def myfunc():')
    'static PyObject*\npy_myfunc(void)\n{'

If repl is a function, it calls each non-repetitive pattern. This function can only have one matching object parameter and return a replaced string. such as

    >>> def dashrepl(matchobj):
    ...     if matchobj.group(0) == '-': return ' '
    ...     else: return '-'
    >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
    'pro--gram files'
    >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
    'Baked Beans & Spam'

Styles can be a string or a style object.

The optional parameter count is the maximum number of times to be replaced; count must be a non-negative integer. If this parameter is ignored or set to 0, all matches will be replaced. Empty matching is replaced only in non-contiguous cases, so sub (‘x*’,’-‘,’abxd’) returns’-a-b-d-‘.

In the repl parameter of the string type, in the escape and backward reference mentioned above, g < name > will use the naming combination name, (in (?P < name >). ) g < number > is defined in grammar and uses numeric groups; g < 2 > is 2, but it avoids ambiguity, such as G < 2 > 0. 20 is interpreted as group 20, not group 2 followed by a character’0′. Refer back to G < 0 > and refer to pattern as a whole group.

Change in Version 3.1: Adding optional markup parameters.

Change in Version 3.5: Unmatched combinations are replaced by empty strings.

Change in Version 3.6: Unknown escape in pattern (consisting of”and an ASCII character) is considered an error.

Change in Version 3.7: Unknown escapes in repl (consisting of”and an ASCII character) are considered errors.

Change in Version 3.7: Empty matches in styles are replaced when adjacent to each other.

  • re.subn(pattern, repl, string, count=0, flags=0)
    The behavior is the same as sub (), but it returns a tuple (string, number of substitutions).
    Change in Version 3.1: Adding optional markup parameters.

Change in Version 3.5: Unmatched combinations are replaced by empty strings.

  • re.escape(pattern)

Escape special characters in pattern. It is useful if you want to match any text string that may contain regular expression metacharacters. such as

   >>> print(re.escape('python.exe'))
   python\.exe

   >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
   >>> print('[%s]+' % re.escape(legal_chars))
   [abcdefghijklmnopqrstuvwxyz0123456789!\#$%\&'\*\+\-\.\^_`\|\~:]+

   >>> operators = ['+', '-', '*', '/', '**']
   >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
   /|\-|\+|\*\*|\*

This function cannot be used in substitution strings of sub () and sub (), only backslashes should be escaped, for example.

    >>> digits_re = r'\d+'
    >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
    >>> print(re.sub(digits_re, digits_re.replace('\', r'\'), sample))
    /usr/sbin/sendmail - \d+ errors, \d+ warnings

Change in Version 3.3:”is no longer escaped.

Change in Version 3.7: Only characters that produce special meanings in regular expressions are escaped.

  • re.purge()

    Clear the regular expression cache.

  • exception re.error(msg, pattern=None, pos=None)

Raise is an exception. When a string passed to a function is not a valid regular expression (for example, containing a mismatched parenthesis) or other errors occur at compile time or match time. If the string does not contain style matching, it will not be considered wrong. The error instance has the following additional attributes:

msg

        Unformatted error messages.

    pattern

        Regular expression style.

    pos

        Location index (which can be None) of pattern that failed to compile.

    lineno

        The line number corresponding to POS (which can be None).

    colno

        The column number corresponding to POS (which can be None).
Change in Version 3.5: Additional properties were added.

Regular expression objects (regular objects)

The compiled regular expression object supports methods and properties:

  • Pattern.search(string[, pos[, endpos]])

Scan the whole string to find the first matching position and return a matching object. If there is no match, return None; note that it is different from zero length matching.
The second optional parameter POS gives the location index of the start of the search in the string; by default 0, it is not exactly equivalent to string slices; the’^’style character matches the true beginning of the string and the first character after the newline character, but does not match the index to specify the start position.
The optional parameter endpos defines the end of the string search; it assumes that the length of the string reaches endpos, so only characters from POS to endpos-1 will be matched. If endpos is less than pos, no matching will occur; in addition, if Rx is a compiled regular object, rx. search (string, 0, 50) is equivalent to rx. search (string [: 50], 0).

    >>> pattern = re.compile("d")
    >>> pattern.search("dog")     # Match at index 0
    <re.Match object; span=(0, 1), match='d'>
    >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
  • Pattern.match(string[, pos[, endpos]])

If any matches of the regular style can be found at the beginning of the string, a corresponding matching object is returned. If it does not match, return None; note that it is different from zero length matching.

The optional parameters POS and endpos have the same meaning as search ().
    >>> pattern = re.compile("o")
    >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
    >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
    <re.Match object; span=(1, 2), match='o'>
If you want to locate matches in strings, use search () instead (see also search () vs. match ()).
  • Pattern.fullmatch(string[, pos[, endpos]])

If the entire string matches the regular expression, a corresponding matching object is returned. Otherwise return to None; note that matching with zero length is different.

The optional parameters POS and endpos have the same meaning as search ().

    >>> pattern = re.compile("o[gh]")
    >>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
    >>> pattern.fullmatch("ogre")     # No match as not the full string matches.
    >>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
    <re.Match object; span=(1, 3), match='og'>

3.4 New Version Function.

  • Pattern.split(string, maxsplit=0)

Equivalent to split () functions, the compiled style is used.

  • Pattern.findall(string[, pos[, endpos]])

Similar to findall (), the compiled style is used, but optional parameters POS and endpos can also be accepted to limit the search scope, just like search ().

  • Pattern.finditer(string[, pos[, endpos]])

Similar to finiter (), the compiled style is used, but optional parameters POS and endpos can also be accepted to limit the search scope, just like search ().

  • Pattern.sub(repl, string, count=0)

Equivalent to the sub () function, the compiled style is used.

  • Pattern.subn(repl, string, count=0)

Equivalent to the subn () function, the compiled style is used.

  • Pattern.flags

Regular matching tags. This is a parameter that can be passed to compile (), any (?) ) Inline tags, implicit tags, such as the combination of UNICODE.

  • Pattern.groups

The number of capture combinations.

  • Pattern.groupindex

Mapping dictionaries of named symbol combinations and number combinations defined by (? P < ID >). If there is no symbol group, the dictionary is empty.

  • Pattern.pattern

The original style string of the compiled object.

Change in Version 3.7: Add support for copy. copy () and copy. deepcopy () functions. Compiled regular expression objects are considered atomic.

Matching object

The matching object always has a Boolean True. If there is no match, match () and search () return None, so you can simply use the if statement to determine whether the match is matched or not.

match = re.search(pattern, string)
if match:
    process(match)

Matching objects support the following methods and attributes:

  • Match.expand(template)

The template is escaped by backslash and returned, just like in the sub () method. As n is converted into the appropriate character, numeric references (1, 2) and named combinations (g < 1 >, g < name >) are replaced by the corresponding combinations.
Change in Version 3.5: Unmatched combinations are replaced by empty strings.

  • Match.group([group1, …])

Returns one or more matching subgroups. If there is only one parameter, the result is a string. If there are more than one parameter, the result is a tuple (each parameter corresponds to an item). If there are no parameters, group 1 defaults to 0 (the whole match is returned). If a group N parameter value is 0, the corresponding return value is the entire matching string; if it is a range [1.99], the result is the corresponding bracket group string. If a group number is negative or larger than the number of groups defined in the style, an IndexError index error raises. If a group is included in a part of the style and matched many times, the last match is returned. :

    >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
    >>> m.group(0)       # The entire match
    'Isaac Newton'
    >>> m.group(1)       # The first parenthesized subgroup.
    'Isaac'
    >>> m.group(2)       # The second parenthesized subgroup.
    'Newton'
    >>> m.group(1, 2)    # Multiple arguments give us a tuple.
    ('Isaac', 'Newton')

If the regular expression is used (?P<name>). ) Syntax, the groupN parameter may also be the name of the named combination. If a string parameter is not defined as a combination name in the style, an IndexError raises.

A relatively complex example

    >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
    >>> m.group('first_name')
    'Malcolm'
    >>> m.group('last_name')
    'Reynolds'

Named combinations can also be referenced by index values

    >>> m.group(1)
    'Malcolm'
    >>> m.group(2)
    'Reynolds'

If a group match succeeds many times, only the last match is returned.

    >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
    >>> m.group(1)                        # Returns only the last match.
    'c3'
  • Match.__getitem__(g)

This is equivalent to M. group (g). This allows a more convenient reference to a match

    >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
    >>> m[0]       # The entire match
    'Isaac Newton'
    >>> m[1]       # The first parenthesized subgroup.
    'Isaac'
    >>> m[2]       # The second parenthesized subgroup.
    'Newton'

3.6 New Version Function.

  • Match.groups(default=None)

Returns a tuple that contains all matching subgroups, from 1 to any number of combinations in the style. The default parameter is used for non-matching cases, and defaults to None.
for example

    >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
    >>> m.groups()
    ('24', '1632')

If we make decimal points optional, not all groups will be involved in matching. These combinations return a None by default unless default parameters are specified.

    >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
    >>> m.groups()      # Second group defaults to None.
    ('24', None)
    >>> m.groups('0')   # Now, the second group defaults to '0'.
    ('24', '0')
  • Match.groupdict(default=None)

Returns a dictionary containing all named subgroups. Key is the group name. The default parameter is used for combinations that do not participate in matches; the default is None. for example

    >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
    >>> m.groupdict()
    {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
  • Match.start([group])
  • Match.end([group])

Returns the start and end labels of the string matched by group. Group defaults to 0 (meaning the entire matching substring). If group exists, but no match is generated, return – 1. For a matching object m and a group G that does not participate in the matching, the matching generated by group G (equivalent to M. group (g)) is

    m.string[m.start(g):m.end(g)]

Note that M. start (group) will be equal to M. end (group), if group matches an empty string. For example, after M = re. search (‘b (c?)’,’cba’), M. start (0) is 1, M. end (0) is 2, M. start (1) and M. end (1) are 2, M. start (2) raise is an IndexError exception.

This example removes remove_this from the email address

    >>> email = "[email protected]_thisger.net"
    >>> m = re.search("remove_this", email)
    >>> email[:m.start()] + email[m.end():]
    '[email protected]'
  • Match.span([group])

For a match m, return a binary (m. start (group), M. end (group). Note that if group is not in this match, it returns (- 1, – 1). Group defaults to 0, which is the whole match.

  • Match.pos

The value of POS is passed to the method a regular object of search () or match (). This is where the regular engine starts searching for a matching index position in the string.

  • Match.endpos

The value of endpos is passed to the method a regular object of search () or match (). This is where the regular engine stops searching for a matching index position in the string.

  • Match.lastindex

Capture the last matched integer index value of the group, or None if no matching occurs. For example, for the string’a b’, expressions (a) b, ((a) (b)), and (ab) will get lastindex = 1, and (a) (b) will get lastindex = 2.

  • Match.lastgroup

The last matching naming group name, or None if no matching is generated.

  • Match.re

Returns the regular object that generated the instance, which was generated by the match () or search () method of the regular object.

  • Match.string

A string passed to match () or search ().

Change in Version 3.7: Support for copy. copy () and copy. deepcopy () was added. Matching objects are considered atomic.

Examples of regular expressions

Check pairs

In this example, we use the following auxiliary functions to better display matched objects:

def displaymatch(match):
    if match is None:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups())

Suppose you’re writing a poker program. A player’s hand is a string of five characters. Each character represents a card. “a” is A, “k” K, “q” Q, “j” J, “t” is 10, and “2” to “9” is 2 to 9.

To see if a given string is valid, we can follow the following steps

>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid.match("akt5q"))  # Valid.
"<Match: 'akt5q', groups=()>"
>>> displaymatch(valid.match("akt5e"))  # Invalid.
>>> displaymatch(valid.match("akt"))    # Invalid.
>>> displaymatch(valid.match("727ak"))  # Valid.
"<Match: '727ak', groups=()>"

The last hand, “727ak”, contains a pair, or two cards of the same value. To match it with regular expressions, you should use a backward reference as follows

>>> pair = re.compile(r".*(.).*")
>>> displaymatch(pair.match("717ak"))     # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak"))     # No pairs.
>>> displaymatch(pair.match("354aa"))     # Pair of aces.
"<Match: '354aa', groups=('a',)>"

To find out which card the pair contains, the group () method should be used as follows:

>>> pair.match("717ak").group(1)
'7'

# Error because re.match() returns None, which doesn't have a group() method:
>>> pair.match("718ak").group(1)
Traceback (most recent call last):
  File "<pyshell#23>", line 1, in <module>
    re.match(r".*(.).*", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'

>>> pair.match("354aa").group(1)
'a'

Simulate scanf ()

Python currently does not have a replacement for the C function scanf (). Regular expressions are usually stronger than scanf () format strings, but they also bring more complexity. The following table provides a mapping of scanf () formatters and regular expressions that are roughly the same.

Python Standard Library 19. Text Processing Services: Re Regular Expression Operations

Extracting strings from file names and numbers

/usr/sbin/sendmail - 0 errors, 4 warnings

You can use scanf () formatting

%s - %d errors, %d warnings

The equivalent regular expression is:

(\S+) - (\d+) errors, (\d+) warnings

search() vs. match()

Python provides two different operations: checking the beginning of a string based on re. match (), or checking any position of a string based on re. search (), which defaults to behavior in Perl.

for example

>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<re.Match object; span=(2, 3), match='c'>

In search (), you can start with’^’to restrict matching to the first place of the string

>>> re.match("c", "abcdef")    # No match
>>> re.search("^c", "abcdef")  # No match
>>> re.search("^a", "abcdef")  # Match
<re.Match object; span=(0, 1), match='a'>

Note that the function match () in MULTILINE multiline mode matches only the beginning of the string, but using search () and regular expressions starting with’^’matches the beginning of each line.

>>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
<re.Match object; span=(4, 5), match='X'>

Establish a telephone book

Split () separates strings from the style passed by parameters. This method is very useful for converting text data to readable and easily modified data structures, as demonstrated by the following examples.

First, here is the input. Usually it’s a file. Here we use the three quotation mark string syntax.

>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
...
... Ronald Heathmore: 892.345.3428 436 Finley Avenue
... Frank Burger: 925.541.7625 662 South Dogwood Way
...
...
... Heather Albrecht: 548.326.4584 919 Park Place"""

Items are separated by one or more line breaks. Now let’s convert the string into a list, with an entry for each non-empty line:

>>> entries = re.split("\n+", text)
>>> entries
['Ross McFluff: 834.345.1254 155 Elm Street',
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
'Frank Burger: 925.541.7625 662 South Dogwood Way',
'Heather Albrecht: 548.326.4584 919 Park Place']

Finally, each entry is split into a list of names, surnames, phone numbers and addresses. We use the maxsplit parameter for split (), because the address contains the space characters that we use as the splitting mode:

>>> [re.split(":? ", entry, 3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

The style matches the colon after the last name, so it does not appear in the result list. If maxsplit is set to 4, we can also get the room number from the address:

>>> [re.split(":? ", entry, 4) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

Text Arrangement

Sub () replaces each instance of the style that appears in the string. This example demonstrates the use of sub () to organize text, or to randomize the position of each character, except for the first and last characters.

>>> def repl(m):
...     inner_word = list(m.group(2))
...     random.shuffle(inner_word)
...     return m.group(1) + "".join(inner_word) + m.group(3)
>>> text = "Professor Abdolmalek, please report your absences promptly."
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

Find all the adverbs

Findall () matching styles all appear, not just as the first matching in search (). For example, if an author wants to find all the adverbs in the text, he may use findall () in the following way.

>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

Find all the adverbs and positions

Finditer () works if more information about matching styles is needed, providing matching objects as return values rather than strings. Continuing with the above example, if an author wants to find all the adverbs and their positions, finditer () can be used as follows

>>> text = "He was carefully disguised but captured quickly by police."
>>> for m in re.finditer(r"\w+ly", text):
...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly

Primitive character notation

The original string notation (r “text”) keeps the regular expression normal. Otherwise, the backslash (”) in each regular must be prefixed with a backslash to escape. For example, the following two lines of code have exactly the same functionality

>>> re.match(r"\W(.)\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>
>>> re.match("\W(.)\1\W", " ff ")
<re.Match object; span=(0, 4), match=' ff '>

When a character backslash needs to be matched, it must be escaped in a regular expression. In the original string notation, it is R “”. Otherwise, we must use \”to express the same meaning.

>>> re.match(r"\", r"\")
<re.Match object; span=(0, 1), match='\'>
>>> re.match("\\", r"\")
<re.Match object; span=(0, 1), match='\'>

Write a lexical analyzer

A lexical or lexical analyzer parses strings and classifies them into catalog groups. This is the first step in writing a compiler or interpreter.

Text directories are specified by regular expressions. This technique is achieved by merging these styles into a primary regularity and cyclic matching

import collections
import re

Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])

def tokenize(code):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
        ('ASSIGN',   r':='),           # Assignment operator
        ('END',      r';'),            # Statement terminator
        ('ID',       r'[A-Za-z]+'),    # Identifiers
        ('OP',       r'[+\-*/]'),      # Arithmetic operators
        ('NEWLINE',  r'\n'),           # Line endings
        ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH', r'.'),            # Any other character
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group()
        column = mo.start() - line_start
        if kind == 'NUMBER':
            value = float(value) if '.' in value else int(value)
        elif kind == 'ID' and value in keywords:
            kind = value
        elif kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
            continue
        elif kind == 'SKIP':
            continue
        elif kind == 'MISMATCH':
            raise RuntimeError(f'{value!r} unexpected on line {line_num}')
        yield Token(kind, value, line_num, column)

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

This lexicograph produces the following output

Token(type='IF', value='IF', line=2, column=4)
Token(type='ID', value='quantity', line=2, column=7)
Token(type='THEN', value='THEN', line=2, column=16)
Token(type='ID', value='total', line=3, column=8)
Token(type='ASSIGN', value=':=', line=3, column=14)
Token(type='ID', value='total', line=3, column=17)
Token(type='OP', value='+', line=3, column=23)
Token(type='ID', value='price', line=3, column=25)
Token(type='OP', value='*', line=3, column=31)
Token(type='ID', value='quantity', line=3, column=33)
Token(type='END', value=';', line=3, column=41)
Token(type='ID', value='tax', line=4, column=8)
Token(type='ASSIGN', value=':=', line=4, column=12)
Token(type='ID', value='price', line=4, column=15)
Token(type='OP', value='*', line=4, column=21)
Token(type='NUMBER', value=0.05, line=4, column=23)
Token(type='END', value=';', line=4, column=27)
Token(type='ENDIF', value='ENDIF', line=5, column=4)
Token(type='END', value=';', line=5, column=9)

Frie09

Friedl, Jeffrey. Mastering Regular Expressions. Third Edition, O’Reilly Media, 2009. Third Edition no longer uses Python, but the first Edition provides good details for writing regular expressions.


  1. 0-9 ↩
  2. tnrfv ↩
  3. a-zA-Z0-9_ ↩