Python technique: implement simple tokenizer with re module

Time:2022-5-5

A simple tokenizer

The tokenization task is the most common task in Python string processing. Here we will explain how to build a simple expression tokenizer with regular expressions, which can parse the expression string from left to right into a stream of tokens.

Give the following expression string:

text = 'foo = 12 + 5 * 6'

We want to convert it into the following word segmentation results presented in sequence pairs:

tokens = [('NAME', 'foo'), ('EQ', '='), ('NUM', '12'), ('PLUS', '+'),\
    ('NUM', '5'), ('TIMES', '*'), ('NUM', '6')]

To complete this word segmentation operation, we first need to define all possible marking patterns (the so-called pattern is the string used to describe or match / series match a syntactic rule, here we use regular expression as the pattern). Note that the space whitespace should be included here, otherwise the scanning will stop after any characters not in the pattern appear in the string. Because we also need to give tags names such as name and EQ, we use the named capture group in the regular expression.

import re
NAME = r'(?P[a-zA-Z_][a-zA-Z_0-9]*)' 
#Here? P stands for the pattern name, () stands for a regular expression capture group, which together is a named capture group
EQ = r'(?P=)'
Num = R '(? P \ D +)' # \ D represents matching numbers, + represents any number
Plus = R '(? P \ +)' # need \ escape
Times = R '(? P \ *)' # need \ escape
Ws = R '(? P \ S +)' # \ s indicates matching spaces, + indicates any number
master_ pat = re. Compile (". Join ([name, EQ, num, plus, times, WS])) # | used to select multiple modes, indicating" or "

Next, we use thescanner()Method to complete the word segmentation operation. This method creates a scanned object:

scanner = master_pat.scanner(text)

Then you can usematch()Method to obtain a single matching result, matching one pattern at a time:

scanner = master_pat.scanner(text)
m = scanner.match() 
print(m.lastgroup, m.group()) # NAME foo
m = scanner.match()
print(m.lastgroup, m.group()) # WS

Of course, such a call is too cumbersome. We can use iterators to call in batches and store the results of a single iteration in the form of named tuples

Token = namedtuple('Token', ['type', 'value'])
def generate_tokens(pat, text):
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        #scanner. Match is used as the method called by the iterator every time,
        #None is the default value of sentinel, which means that the iteration stops until none
        yield Token(m.lastgroup, m.group())
    
for tok in generate_tokens(master_pat, "foo = 42"):
    print(tok)

Final display expression string"foo = 12 + 5 * 6"The tokens stream of is:

Token(type='NAME', value='foo')
Token(type='WS', value=' ')
Token(type='EQ', value='=')
Token(type='WS', value=' ')
Token(type='NUM', value='12')
Token(type='WS', value=' ')
Token(type='PLUS', value='+')
Token(type='WS', value=' ')
Token(type='NUM', value='5')
Token(type='WS', value=' ')
Token(type='TIMES', value='*')
Token(type='WS', value=' ')
Token(type='NUM', value='6')

Filter tokens stream

Next, we want to filter out the space mark, and use the generator expression:

tokens = (tok for tok in generate_tokens(master_pat, "foo = 12 + 5 * 6")
          if tok.type != 'WS')
for tok in tokens:
    print(tok)

You can see that the spaces have been successfully filtered:

Token(type='NAME', value='foo')
Token(type='EQ', value='=')
Token(type='NUM', value='12')
Token(type='PLUS', value='+')
Token(type='NUM', value='5')
Token(type='TIMES', value='*')
Token(type='NUM', value='6')

Note the substring matching trap

Tokens in regular expressions (i.e"|".join([NAME, EQ, NUM, PLUS, TIMES, WS])Order is also very important. Because when matching,reThe module will match the patterns in the specified order. Therefore, if a pattern happens to be a substring of another longer pattern, the longer pattern must be matched first. As shown below, the correct and wrong matching methods are shown respectively:

LT = r'(?P<=)'
EQ = r'(?P>=)'
master_ pat = re. Compile ("|". Join ([le, lt, EQ])) # correct order
master_ pat = re. Compile ("|". Join ([LT, Le, EQ])) # wrong order

The error of the second order is that it will'<='Text matches LT(') followed by EQ (' = ') without matching as a separate Le (< =).

We should also be careful about the “possible” mode of forming substrings, such as the following:

PRINT = r'(?Pprint)'
NAME = r'(?P[a-zA-Z_][a-zA-Z_0-9]*)'

master_ pat = re. Compile ("|". Join ([print, name])) # correct order

for tok in generate_tokens(master_pat, "printer"):
    print(tok)

Can be seen byprintIn fact, it becomes a substring of another pattern, resulting in problems in the matching of another pattern:

# Token(type='PRINT', value='print')
# Token(type='NAME', value='er')

For more advanced grammar word segmentation, it is recommended to use packages such as pyparsing or ply. In particular, the word segmentation of English natural language articles is generally integrated into various NLP packages (generally divided into three steps: splitting according to space, processing pre suffix and removing stop words). There are also rich tools for Chinese natural language processing word segmentation (such asjiebaWord segmentation Toolkit).

quote

  • [1] Martelli A, Ravenscroft A, Ascher D. Python cookbook[M]. ” O’Reilly Media, Inc.”, 2015.