Parser Series 6 of Python’s father: adding actions to peg syntax


Original question | Adding Actions to a PEG Grammar

author|Guido van Rossum (father of Python)

Translator|Pea cat

statement|This translation is for the purpose of communication and learning, based on CC by-nc-sa 4.0 license agreement. For the convenience of reading, the content is slightly changed.

If you can add (some) semantics to the grammar rules, the grammar will be better. Especially for the python parser I’m building, I need to control the ast node returned by each alternative, because the format of AST has been specified.

[this is part 6 of my peg series. For the rest, please refer to the overview of the series.

Many grammars have conventions that support adding actions to rules, usually a block of code within {curly braces}. Rather, actions are associated with alternatives. The code in the action block is usually the same as the language in which the compiler is written, such as C language. Some tools are added to reference the entries in the alternatives. In Python’s original pgen, I didn’t add this feature, but for this new project, I want to use it.

For the simplified parser generator developed in this series of blog articles, here’s what we do.

In general, the syntax of an action is as follows:

rule: item item item { action 1 } | item item { action 2 }

Because it makes syntax verbose, parser generators often support cross line segmentation rules, such as:

rule: item item item { action 1 }
    | item item { action 2}

It complicates the parser, but readability is more important, so I’ll use this approach.

An eternal question is when to execute the action block. In yacc / bison, since there is no backtracking, once the rule is recognized by the parser, the action block is executed. Each action is executed immediately, which means that even if the action has global side effects, it will execute smoothly (such as updating symbol tables or other compiler data structures).

In the peg parser, we have other options because of infinite backtracking:

  • Delay all actions until all content has been parsed. This is not useful for my purposes because I want to construct an ast during parsing.
  • As long as the alternative corresponding to the action is identified, it is executed, but the operation code is required to be idempotent (that is, no matter how many times it is executed, it has the same effect). This means that an action can be performed, but the result is eventually discarded.
  • The result of the action is cached, so the action is executed only the first time an alternative is identified at a given location.

I’m going to take the third option – just as we cache things with the packrat algorithm, so we can also cache the results of the actions.

As for the contents of {curly bracket}, traditionally C language is used, and it is agreed that$Symbol to refer to the identified alternatives (for example,$1Reference the first entry) and assign to$$To indicate the result of the action.

In my opinion, this is too old-fashioned (I remember using the assignment of function name in ALGOL-60 to specify the return value), so I will use some more Python methods: in parentheses, you need to put a single expression whose value is the value of the action, while the reference of the entry is some simple name, giving the text of the entry.

For example, this is a simple calculator that can be used for addition and subtraction:

start: expr NEWLINE { expr }
expr: expr '+' term { expr + term }
    | expr '-' term { expr - term }
    | term { term }
term: NUMBER { float(number.string) }

When we run, given the input100+50-38-70, it will identify each part and calculate the answer((100+50)-38)-70Of course, the result is 42.

A small detail: intermIn action, variablenumberSaved aTokenInfoObject, so the action must use its.stringProperty to get an identifier in the form of a string.

What should we do when the same rule name appears multiple times in an alternative? For the rules appearing in the same alternative, the parser generator will give a unique name, that is, add 1, 2, and so on to the subsequent rules. For example:

factor: atom '**' atom { atom ** atom1 }
      | atom { atom }

Its implementation is boring, so I ask you to check out the code and have a look for yourself. Try this:

python3.8 -m story5.driver story5/calc.txt -g story5.calc.CalcParser

Visualization now supports moving back and forth using the left and right arrow keys!

The content of this paper and the authorization protocol of example code: CC by-nc-sa 4.0

Parser Series 6 of Python's father: adding actions to peg syntax

Public number [Python cat】, series of high-quality articles, including meow star philosophy cat series, python advanced series, good book recommendation series, technical writing, high-quality English recommendation and translation, etc., welcome to pay attention.