Implement a simple compiler

Time:2021-9-15

preface

Compilers are used in various situations, fromwebpackreachbabelFor example, to the inside of the framevue, more or less use the compiler, so this time let’s learn about the most basic implementation of the compiler.

target

Just one this timelisp-likeFunction call mode is converted to JavaScript mode.

The function calling methods of the two languages are compared as follows:

LISP JavaScript
2 + 2 (add 2 2) add(2, 2)
4 – 2 (subtract 4 2) subtract(4, 2)
2 + (4 – 2) (add 2 (subtract 4 2)) add(2, subtract(4, 2))

The former can probably be described as parentheses to represent function calls, and the parameters are separated by spaces.

Let’s assume that our source code is like this

(add 100 (substract 3 2))

that is100 + (3 - 2), let’s start.

thinking

The general compiler is divided into the following steps:

  1. Parsing-Parse the source code text into a more abstract expression, usually ast (abstract syntax tree) – abstract syntax tree.
  2. Transformation-The original ast can be changed according to specific needs.
  3. Code Generation-Generate code from ast.

Then start step by step.

Parsing

This step is mainly for source code analysis, which can be subdivided into two steps:

  1. Lexical Analysis(lexical analysis) – divide the source code into many independent fragments in the dimension of words.
  2. Syntactic Analysis(syntax analysis) – connect independent code fragments in a tree structure to generate ast.

Lexical Analysis

This step usually usestokenizerExpression, pass in source code, returntokens-I call it a word array.

For our example, after splitting, it will be expressed as follows:

[
  { type: 'paren', value: '(' },
  { type: 'name', value: 'add' },
  { type: 'number', value: '100' },
  { type: 'paren', value: '(' },
  { type: 'name', value: 'substract' },
  { type: 'number', value: '3' },
  { type: 'number', value: '2' },
  { type: 'paren', value: ')' },
  { type: 'paren', value: ')' }
]

This step can be understood as splitting a sentence with a single word as the dimension. Analogy to Chinese, for example

I go to the restaurant for dinner

Can be split into

I, go to the restaurant, eat

This collection of words distinguished by part of speech.

After knowing its meaning, the code is implemented, as follows:

function tokenizer(input) {
    //Word array
    const tokens = []

    //Set a pointer
    let current = 0
    //Traverse the source code from 0
    while (current < input.length) {

        let char = input[current]

        //Skip if space
        const spaceRegExp = /\s/
        if (spaceRegExp.test(char)) {
            current++
            continue
        }

        //If it is a bracket, it is added to the result
        if (char === '(' || char === ')') {
            tokens.push({ type: 'paren', value: char })

            //The pointer points to the next bit to start the next cycle
            current++
            continue
        }

        //If it is a lowercase letter (only lowercase letters are supported here for the time being)
        //Then the last lowercase letter will be accumulated and put into the result
        const letterRegExp = /[a-z]/
        if (letterRegExp.test(char)) {
            let value = ''

            while (letterRegExp.test(char)) {
                value += char
                char = input[++current]
            }

            tokens.push({ type: 'name', value })
            continue
        }

        //If it is a number, it will accumulate and traverse to the last array and put it into the result
        const numberRegExp = /[0-9]/
        if (numberRegExp.test(char)) {
            let value = ''

            while (numberRegExp.test(char)) {
                value += char
                char = input[++current]
            }

            tokens.push({ type: 'number', value })
            continue
        }

        Throw new error ('lexical analysis failed, unsupported word type ')
    }

    return tokens
}

The general logic is to set a pointer and constantly look for qualified words one by one through appropriate rules.

Syntactic Analysis

The next step is syntax analysis, which associates the newly obtained words and generates a tree structure, that is, an abstract syntax tree.

It will look like this:

{
  "type": "Program",
  "body": [
    {
      "type": "CallExpression",
      "name": "add",
      "params": [
        {
          "type": "NumberLiteral",
          "value": 100
        },
        {
          "type": "CallExpression",
          "name": "substract",
          "params": [
            {
              "type": "NumberLiteral",
              "value": 3
            },
            {
              "type": "NumberLiteral",
              "value": 2
            }
          ]
        }
      ]
    }
  ]
}

This step is equivalent to connecting words into sentences and using tree structure to express the relationship between them.

For example, functions have types, function names, and parameters. These can theoretically be customized according to the needs of each different language.

It is mainly the implementation of the code. Because each child node may be of various types, it is obviously more convenient to implement it with recursion. The specific code is as follows.

function parser(tokens) {

    //Set a pointer, starting from 0
    let current = 0

    //Recursion is easier to implement here
    function parse() {
        let token = tokens[current]

         //If it is a number, the numeric node is returned, and the pointer points to the next node
        if (token.type === 'number') {
            current++
            return {
                type: 'NumberLiteral',
                value: +token.value,
            }
        }

        //If left parenthesis
        if (token.type === 'paren' && token.value === '(') {

            //Generate a node of type call expression
            const node = {
                type: 'CallExpression',
                name: '',
                params: [],
            }

            //Point to the next token. Normally, it must be of type name
            token = tokens[++current]
            if (token.type !== 'name') {
                Throw new error ('No function name provided ')
            }

            node.name = token.value
            //Then point to the next token
            token = tokens[++current]

            //As long as it is not a right parenthesis, it is always added to the parameter
            while (!(token.type === 'paren' && token.value === ')')) {
                node.params.push(parse())
                //Update current pointer
                token = tokens[current]
            }

            //Skip closing bracket
            current++

            return node
        }

        Throw new error ('token type error ')
    }

    const ast = {
        type: 'Program',
        body: [],
    }

    //Put all the nodes generated by the token into the body (if it is a multi line statement, there will be multiple objects)
    while (current < tokens.length) {
        ast.body.push(parse())
    }

    return ast
}

The general idea isparseFunctions have specific processing methods for specific types of values. When there is a syntax that depends on other values (such as parameter call, which may be any content), they will be called recursively, and maintain a pointer to establish the position.

Transformation

In fact, if our requirement is only to generate code, we can directly proceed to the next step through the ast obtained aboveCode GenerationIn that case, we must generate corresponding JavaScript code for our own tree structure. In fact, the two grammars may be in different languages, so it is better to convert our previous ast into a more standard syntax.

Of course, there are more commonly used specifications on the marketestreeFor example, a very common compileracornIt meets this standard. (at present, webpack and rollup are based on him, and @ Babel / parser is also a reference.) the specific generation results can be found inhttps://astexplorer.net/Try.

So we next consider turning our ast into an ast conforming to the estree specification.

Therefore, we must traverse all nodes, so we set our own processing function for each node, calledvisitor, that’s about it.

const visitor = {
   Program: function (node, parent) {
      // ...
   },
   NumberLiteral: function (node, parent) {
      // ...
   },
   // ...
}

The parameters required for each function can be considered according to specific needs. Here, in order to obtain the simplest relationship correspondence, the current node and parent node are temporarily passed in.

In addition, because our requirements are relatively simple, we don’t need to care about the access time. If necessary, we can set access start and end time points, such as

const visitor = {
   Program: {
      enter () {
         // ...
      },
      exit () {
         // ...
      },
   }
   // ...

We don’t need it here. Next, let’s implement the specific code.

function traverser(ast, visitor) {
    //Access a single node
    function traverseNode (node, parent) {
        //Execute the current access function
        const method = visitor[node.type]
        method && method(node, parent)

        //If there are child nodes, traverse the child nodes
        switch(node.type) {
            case 'Program':
                traverseArray(node.body, node)
                break
            case 'CallExpression':
                traverseArray(node.params, node)
                break
            case 'NumberLiteral':
                break
            default:
                Throw new error ('node type error ')
        }
    }

    //Accessing array nodes
    function traverseArray(array, parent) {
        array.forEach(child => {
            traverseNode(child, parent)
        })
    }

    //Access ast root node
    traverseNode(ast, null)
}

This function provides AST and visitor, allowing us to access each node, and the specific operation of each node should be based on needs. We want to convert the syntax into the syntax conforming to the estree standard, so we implement another conversion function.

function transform(ast) {
    const newAst = {
        type: 'Program',
        body: [],
    }

    //A property is set here to point to the new ast context
    ast._context = newAst.body
    traverser(ast, {
        NumberLiteral(node, parent) {
            //Convert to the digital node of estree standard and put it into the parent node context
            parent._context.push({
                type: 'Literal',
                value: node.value,
            })
        },
        CallExpression(node, parent) {
            //Calling node structure of estree standard
            let expression = {
                type: 'CallExpression',
                callee: {
                    type: 'Identifier',
                    name: node.name,
                },
                arguments: [],
            }

            //Set context to parameter array
            node._context = expression.arguments

            //If the parent node is not a calling expression, the outer layer of the expression needs to be set to the estree standard
            if (parent.type !== 'CallExpression') {
                expression = {
                    type: 'ExpressionStatement',
                    expression,
                }
            }

            //The current expression is placed in the parent node context
            parent._context.push(expression)
        },
    })

    return newAst
}

In fact, the internal implementation of this function is completely based on needs. Because we need to convert the format here, we generate a new ast, judge the node, and put the new format converted by the corresponding node into the new tree.

After that, we get the following results:

{
  "type": "Program",
  "body": [
    {
      "type": "ExpressionStatement",
      "expression": {
        "type": "CallExpression",
        "callee": {
          "type": "Identifier",
          "name": "add"
        },
        "arguments": [
          {
            "type": "Literal",
            "value": 100
          },
          {
            "type": "CallExpression",
            "callee": {
              "type": "Identifier",
              "name": "substract"
            },
            "arguments": [
              {
                "type": "Literal",
                "value": 3
              },
              {
                "type": "Literal",
                "value": 2
              }
            ]
          }
        ]
      }
    }
  ]
}

At present, the tree structure conforms to the estree standard, so that we can use other third-party libraries to operate the tree. such asescodegen, you can generate code through estree. The execution result of the tree structure above is:

add(100, substract(3, 2));

Code Generation

Although the above can be implemented with other libraries, let’s take a look at the principle of generating code through ast. The general principle is actually very simple. It is to judge the type of the current node and recursively generate code for it and its child nodes.

function generateCode(node) {
    switch(node.type) {
        case 'Program':
            //Generate code for each node of the body, separated by a newline
            return node.body.map(generateCode).join('\n')
        case 'Literal':
            //Literal direct return value
            return node.value
        case 'ExpressionStatement':
            //Expression returns the code generated by the expression, ending with a semicolon
            return generateCode(node.expression) + ';'
        case 'CallExpression':    
            //If it is a function call, the code generated by the parameters is separated by commas
            return `${node.callee.name}(${node.arguments.map(generateCode).join(', ')})`
        default:
            Throw new error ('node type error ')
    }
}

If you add up the above steps, it will probably be as follows:

const tokens = tokenizer(code)
const ast = parser(tokens)
const newAst = transform(ast)
const result = generateCode(newAst)
console.log(result)
// => add(100, substract(3, 2));

Then it’s done.

summary

This time, the general working principle of the compiler is introduced. It may look different in various scenarios, but the core idea is to take these steps. Then, in the face of various actual situations, actively improve it!

This article refers tothe-super-tiny-compilerMany contents of this project are recommended to learn.

reference resources