The father of Python wrote: why to create pgen parser?


The father of Python wrote: why to create pgen parser?

Flower cat language:Recently, Python’s father opened a blog on medium and published an article on the peg parser (see my translation). As far as I know, he has his own blog. Why does he go to media to write? Curious, I opened his old blog.

The last article was written in May 2018, and unfortunately, it was written in pgen parser, which he had been reluctant to make complaints about in the new text, saying that pgen would be replaced. In this old article, Guido recalled some of his considerations when he created pgen. At that time, it was undoubtedly wise to create a new parser, but the times have changed and now he has a better choice.

Not long ago, we talked about the removal plan of Gil in Python, the “operation” plan of built-in battery and the evolution story of print. Now, its parser is also going to be transformed. Python is almost 30 years old, and it’s hard to keep its vitality alive. Let’s bless it together and hope for a better future.

This article is original and first appeared in the official account.Python cat】, do not reprint without authorization.

Original address:

Original question | The origins of pgen

author|Guido van Rossum (father of Python)

TranslatorTofu pudding cat (Python cat official account)

original text |

statement|Translation is for the purpose of communication and learning. Please reprint it, but please keep the source of this article and do not use it for commercial or illegal purposes.

David Beazley’s speech on us pycon 2018, about parser generators, reminded me that I should write about its history. This is a short brain dump (maybe I’ll explain it later).

(translation note: I dare to speculate about “brain dump”. It should be said that it is a process of storage and solidification to convert personal memory and python historical details into words, which is convenient for inheritance. And my translation job is to make this document rich and popularize it to more Python fans.)

Actually,There are two pgen, one is the original, written in C language, and the other is rewritten in Python, under lib2to3 / pgen2.

I wrote both. The first one was actually the first code I wrote for Python. Although technically, I have to write lexer first (pgen and python share the lexer, pgen doesn’t work with most tagging characters).

The reason I had to write my own parsing generator was that it was (I’m familiar with it) pretty rare at the time – basically using yacc (there’s a GNU rewrite called bison, but I’m not sure if I knew it at that time); or writing one myself (that’s what most people do).

I used yacc in University, and I was familiar with its working principle from the “dragon book”, but for some reasons, I didn’t like it; IIRC’s limitations on LALR (1) grammar are hard for me to explain.

(Note: 1. Dragon book is originally a dragon book, referring to compilers: principles, techniques, and tools. It is a book about compilation principles and belongs to the palace level existence in the field of compilation principles. In addition, there are two classic books, named “tiger book” and “whale book”, which often appear together. 2. IIRC, if I remember correctly.)

The father of Python wrote: why to create pgen parser?

I am also familiar with LL (1) parser, and have carefully written some recursive descent LL (1) parsers – I like it very much, and I am also familiar with the generation technology of LL (1) parser (also because of Longshu), so I have an improvement idea to experiment: use regular expression (to some extent) instead of standard BNF format.

Longshu also taught me how to convert regular expressions into DFA, so when I combined all these things, pgen was born. [update: see below for a slightly different version of this reason. ]

I was not familiar with more advanced technologies or thought they were too inefficient. (at the time, I thought that was the case for most people working on parsers.)

As for lexer, I decided not to use a generator – I rated Lex much lower than yacc, because when I tried to scan for tags with more than 255 bytes, my familiar version of Lex had a segment error (real!). In addition, I find it difficult to teach the indentation format to a lexical analyzer generator.

The generator here is not a generator in Python syntax, but a tool used to generate an analyzer. Lex is the abbreviation of “lexical compiler”, which is used to generate lexical analyzer; yacc is the abbreviation of “yet another compiler”, which is used to generate parser. 2. Segment error, the original is segfault, the full name is segmentation fault, which refers to the error reported due to the out of bounds access to memory space.)

The story of pgen2 is quite different.

I was employed by a San Mateo startup (i.e. elemental security, which closed down in 2007, after which I left and joined Google), where I had a task to design a customized language (the goal was to make a security judgment on the system configuration), and had considerable autonomy.

I decided to design something a little bit like python, implement it in Python, and decide to reuse pgen, but the backend is based on python, using as a lexical analyzer. So I rewrite the algorithms in pgen in Python, and I continue to build the rest.

Management felt that it made sense to open source the tools, so they quickly approved it, and soon after (I probably moved to Google at that time?) , this tool is also meaningful for 2to3. (because the input format is the same as the original pgen, it’s easy to generate a python parser with it – I just feed the syntax file to the tool. : -)

The father of Python wrote: why to create pgen parser?

Update: why pgen was created, and more stories

I don’t remember exactly why, but I just peeked at… And I might think it’s a new (for me) way to resolve conflicts without adding helpful rules.

For example, the so-called left decomposition of this page (replacing a – > x | X Y Z with a – > x B; B – > y z | < empty >), I will rewrite it as a – > x [y z].

If I remember correctly, through the conversion process of “regular expression – > NFA – > DFA”, the parsing engine (the previous syntax analysis function in this web page) can still work on the parsing table derived from these rules; I think there needs to be a claim without blank products. (translation note: “blank products”, the original text is empty products, corresponding to the previous < empty >, which means that there is no need for empty.)

I also remember that the resolution tree node generated by the resolution engine may have many children. For example, for the above rule a – > x [y z], node a may have one child (x) or three (x y z). There needs to be a simple check in the code generator to determine which possible situation it is encountering. (this has proved to be a double-edged sword. Later, we added a “parse tree – > ast” step driven by a separate generator to simplify the bytecode generator.)

So the reason why I use regular expressions is probably to make the syntax easier to read: after using the necessary rewriting to solve the conflict, I found that the syntax is not so readable (here should insert the phrase of Python Zen: -), while regular expressions are more consistent with my view on the syntax of classic languages (except for the help rules with strange names, [optional] Part and repeated part with * sign).

The father of Python wrote: why to create pgen parser?

Regular expression does not improve the ability of LL (1), let alone reduce its ability. Of course, the so-called “regular expression” is actually EBNF – I’m not sure whether “EBNF” is a clearly defined symbol at that time, which may refer to any extension of BNF.

If we convert EBNF to BNF and use it again, it will lead to awkward multi parse tree node problem, so I don’t think it will be an improvement.

If I had to do it again, I might have chosen a more powerful parsing engine, perhaps a version of LALR (1) (for example, yacc / bison). LALR (1) is awesome and more useful than LL (1), for example, keyword parameters.

In LL (1), the rule “Arg: [name =] expr” is invalid because name appears in the first set of expressions, and the LL (1) algorithm cannot handle such writing.

If I remember correctly, LALR (1) can handle it. However, it was years after I wrote the first version of pgen that keyword parameter writing appeared, and I didn’t want to redo the parser at that time.

Updated March 2019:Python 3.8 will remove the C version of pgen and use the rewritten pgen2 version instead. See

(Note: I feel that I can add another “update” for Guido. At present, he is studying the peg parser, which will serve as a replacement for pgen. For details, please see the new post of Python’s father, which will replace the existing parser.)

The father of Python wrote: why to create pgen parser?

The official account.Python cat】, series of high-quality articles, including meow star philosophy cat series, python advanced series, good book recommendation series, technical writing, high-quality English recommendation and translation, etc., welcome to pay attention.

Recommended Today

Oracle scheduled tasks

Timing task query To query Oracle scheduled tasks, you can use: –Scheduled tasks for all users SELECT * FROM dba_jobs; –Timing task of the user select * from user_jobs; In the query results, the what field generally stores the name of the stored procedure (or the specific stored procedure content). Broken = n indicates that […]