A simple tokenizer
The tokenization task is the most common task in Python string processing. Here we will explain how to build a simple expression tokenizer with regular expressions, which can parse the expression string from left to right into a stream of tokens.
Give the following expression string:
text = 'foo = 12 + 5 * 6'
We want to convert it into the following word segmentation results presented in sequence pairs:
tokens = [('NAME', 'foo'), ('EQ', '='), ('NUM', '12'), ('PLUS', '+'),\ ('NUM', '5'), ('TIMES', '*'), ('NUM', '6')]
To complete this word segmentation operation, we first need to define all possible marking patterns (the so-called pattern is the string used to describe or match / series match a syntactic rule, here we use regular expression as the pattern). Note that the space whitespace should be included here, otherwise the scanning will stop after any characters not in the pattern appear in the string. Because we also need to give tags names such as name and EQ, we use the named capture group in the regular expression.
import re NAME = r'(?P[a-zA-Z_][a-zA-Z_0-9]*)' #Here? P stands for the pattern name, () stands for a regular expression capture group, which together is a named capture group EQ = r'(?P=)' Num = R '(? P \ D +)' # \ D represents matching numbers, + represents any number Plus = R '(? P \ +)' # need \ escape Times = R '(? P \ *)' # need \ escape Ws = R '(? P \ S +)' # \ s indicates matching spaces, + indicates any number master_ pat = re. Compile (". Join ([name, EQ, num, plus, times, WS])) # | used to select multiple modes, indicating" or "
Next, we use the
scanner()Method to complete the word segmentation operation. This method creates a scanned object:
scanner = master_pat.scanner(text)
Then you can use
match()Method to obtain a single matching result, matching one pattern at a time:
scanner = master_pat.scanner(text) m = scanner.match() print(m.lastgroup, m.group()) # NAME foo m = scanner.match() print(m.lastgroup, m.group()) # WS
Of course, such a call is too cumbersome. We can use iterators to call in batches and store the results of a single iteration in the form of named tuples
Token = namedtuple('Token', ['type', 'value']) def generate_tokens(pat, text): scanner = pat.scanner(text) for m in iter(scanner.match, None): #scanner. Match is used as the method called by the iterator every time, #None is the default value of sentinel, which means that the iteration stops until none yield Token(m.lastgroup, m.group()) for tok in generate_tokens(master_pat, "foo = 42"): print(tok)
Final display expression string
"foo = 12 + 5 * 6"The tokens stream of is:
Token(type='NAME', value='foo') Token(type='WS', value=' ') Token(type='EQ', value='=') Token(type='WS', value=' ') Token(type='NUM', value='12') Token(type='WS', value=' ') Token(type='PLUS', value='+') Token(type='WS', value=' ') Token(type='NUM', value='5') Token(type='WS', value=' ') Token(type='TIMES', value='*') Token(type='WS', value=' ') Token(type='NUM', value='6')
Filter tokens stream
Next, we want to filter out the space mark, and use the generator expression:
tokens = (tok for tok in generate_tokens(master_pat, "foo = 12 + 5 * 6") if tok.type != 'WS') for tok in tokens: print(tok)
You can see that the spaces have been successfully filtered:
Token(type='NAME', value='foo') Token(type='EQ', value='=') Token(type='NUM', value='12') Token(type='PLUS', value='+') Token(type='NUM', value='5') Token(type='TIMES', value='*') Token(type='NUM', value='6')
Note the substring matching trap
Tokens in regular expressions (i.e
"|".join([NAME, EQ, NUM, PLUS, TIMES, WS])）Order is also very important. Because when matching,
reThe module will match the patterns in the specified order. Therefore, if a pattern happens to be a substring of another longer pattern, the longer pattern must be matched first. As shown below, the correct and wrong matching methods are shown respectively:
LT = r'(?P<=)' EQ = r'(?P>=)' master_ pat = re. Compile ("|". Join ([le, lt, EQ])) # correct order master_ pat = re. Compile ("|". Join ([LT, Le, EQ])) # wrong order
The error of the second order is that it will
'<='Text matches LT（
') followed by EQ (' = ') without matching as a separate Le (< =).
We should also be careful about the “possible” mode of forming substrings, such as the following:
PRINT = r'(?Pprint)' NAME = r'(?P[a-zA-Z_][a-zA-Z_0-9]*)' master_ pat = re. Compile ("|". Join ([print, name])) # correct order for tok in generate_tokens(master_pat, "printer"): print(tok)
Can be seen by
# Token(type='PRINT', value='print') # Token(type='NAME', value='er')
For more advanced grammar word segmentation, it is recommended to use packages such as pyparsing or ply. In particular, the word segmentation of English natural language articles is generally integrated into various NLP packages (generally divided into three steps: splitting according to space, processing pre suffix and removing stop words). There are also rich tools for Chinese natural language processing word segmentation (such as
jiebaWord segmentation Toolkit).
-  Martelli A, Ravenscroft A, Ascher D. Python cookbook[M]. ” O’Reilly Media, Inc.”, 2015.