Lexical analyzer

Time:2021-5-3

Wikipedia: lexical analysis is the process of transforming character sequence into marker sequence in computer science.
A program or function that performs lexical analysis is called a lexical analyzer.

There is the following original program code

add_result = 1 + 2

The results are as follows

NAME   `add_result` 0,  0
SYMBOL `=`          0, 11
INT    `1`          0, 13
SYMBOL `+`          0, 15
INT    `2`          0, 17

In tabular form

Tag type Literal Line number Column number
NAME add_result 0 0
SYMBOL = 0 11
INT 1 0 13
SYMBOL + 0 15
INT 2 0 17

We can use go language to easily implement the available lexical analyzer


Implementation of lexical analyzer with go language

package main

import (
    "fmt"
    "regexp"
    "unicode/utf8"
    "os"
)

var exprs = []string{"\d+", "[\p{L}\d_]+", "[\+\-=]"}
var names = []string{"INT",  "NAME",         "SYMBOL"}

func main() {
    rules := []*regexp.Regexp{}
    for i, expr := range exprs {
        rule, _ := regexp.Compile("^" + expr)
        rules = append(rules, rule)
        fmt.Println(names[i], rule)
    }

    fmt.Println("--------------------------------")
    for row, code := range os.Args[1:] {
        position := 0
        col := 0
        for true {
            for position < len(code) && (code[position] == ' ' || code[position] == '\t') {
                position += 1
                col += 1
            }
            if position >= len(code) {
                break
            }
            source := ""
            tokenType := -1
            for i, rule := range rules {
                source = rule.FindString(code[position:])
                if source != "" {
                    tokenType = i
                    break
                }
            }
            if tokenType >= 0 {
                fmt.Printf("%s\t`%s`\t%d\t%d\n", names[tokenType], source, row, col)
                position += len(source)
                col += utf8.RuneCountInString(source)
            } else {
                fmt.Printf("error in: %d, %d\n", row, col)
                break
            }
        }
    }

}

Running tests on the command line

➜ go run lexer. Go "value = Pi + 100"
INT     ^\d+
NAME    ^[\p{L}\d_]+
SYMBOL  ^[\+-=]
--------------------------------
Name ` value ` 0
SYMBOL  `=`     0   3
NAME    `PI`    0   5
SYMBOL  `+`     0   8
INT     `100`   0   10

Go language code description

Introduce the required packages:

package main
 
import (
    "fmt"
    "regexp"
    "unicode/utf8"
    "os"
)
  • FMT for printout
  • Regexp regular expression
  • Unicode / utf8 counts the number of utf8 runes
  • OS to get user input

Specify the regular expression and field type name:

var exprs = []string{"\d+", "[\p{L}\d_]+", "[\+\-=]"}
var names = []string{"INT",  "NAME",         "SYMBOL"}

Create two string arrays to store the regular expression and the corresponding field type name.

Initialize field matching rule:

func main() {
    rules := []*regexp.Regexp{}
    for i, expr := range exprs {
        rule, _ := regexp.Compile("^" + expr)
        rules = append(rules, rule)
        fmt.Println(names[i], rule)
    }

It is important to note that you must insert a prefix for each regular expression^It is used to ensure that the matching string includes the leftmost character to avoid “jump matching”.

Loop matching field:

for row, code := range os.Args[1:] {
    position := 0
    col := 0
    for true {
        for position < len(code) && (code[position] == ' ' || code[position] == '\t') {
            position += 1
            col += 1
        }
        if position >= len(code) {
            break
        }
        source := ""
        tokenType := -1
        for i, rule := range rules {
            source = rule.FindString(code[position:])
            if source != "" {
                tokenType = i
                break
            }
        }
        if tokenType >= 0 {
            fmt.Printf("%s\t`%s`\t%d\t%d\n", names[tokenType], source, row, col)
            position += len(source)
            col += utf8.RuneCountInString(source)
        } else {
            fmt.Printf("error in: %d, %d\n", row, col)
            break
        }
    }
}

Using traversalos.Args[1:]The method takes each parameter input by the user as a line of code for lexical analysis.

Skip [ignore] null character:

for position < len(code) && (code[position] == ' ' || code[position] == '\t') {
    position += 1
    col += 1
}

Because our regular expression must match the leftmost character, we need to skip some empty characters that are often meaningless.

Determine whether to interrupt the cycle:

if position >= len(code) {
    break
}

Traverse the matching rule and try to match:

source := ""
tokenType := -1
for i, rule := range rules {
    source = rule.FindString(code[position:])
    if source != "" {
        tokenType = i
        break
    }
}

Loop through the set rules to match, if successful, set the subscript totokenTypeIf there is no match, thentokenTypedefault-1

According to the matching results, the subsequent behavior is judged

if tokenType >= 0 {
    fmt.Printf("%s\t`%s`\t%d\t%d\n", names[tokenType], source, row, col)
    position += len(source)
    col += utf8.RuneCountInString(source)
} else {
    fmt.Printf("error in: %d, %d\n", row, col)
    break
}

IftokenTypeNot for-1The field name, literal amount, row and column information will be printed and setpositionMake it skip the current field,
Note the column number at the beginning of the next fieldcolThe increment of must be obtained by the rune counting method of utf8, otherwise some Unicode / utf8 codes will not be correctly pointed to.

Python users can also easily implement the


Python lexical analyzer

import re
import sys


exprs = ['\d+', '\w+', '[\+\-=]']
names = ['INT',  'NAME', 'SYMBOL']


def main():
    rules = []
    for i, expr in enumerate(exprs):
        rules.append(re.compile('^' + expr))
        print(names[i], rules[-1].pattern)

    print('-' * 32)
    for row, code in enumerate(sys.argv[1:]):
        position = 0
        while True:
            while position < len(code) and (code[position] == ' ' or code[position] == '\t'):
                position += 1
            if position >= len(code):
                break

            source = ''
            tokenType = -1
            for i, rule in enumerate(rules):
                result = rule.findall(code[position:])
                if len(result) > 0:
                    source = result[0]
                    tokenType = i
                    break
            if tokenType >= 0:
                print(f'{names[tokenType]}\t`{source}`\t{row}\t{position}')
                position += len(source)
            else:
                print(f'error in {row}, {position}')
                break


if __name__ == "__main__":
    main()

As a supplement, the C + + solution is also provided here


Implementation of lexical analyzer in C + +

#include <locale>
#include <regex>
#include <string>
#include <vector>
#include <codecvt>


std::vector<std::wstring> exprs{L"\d+", L"\w+", L"[\+\-=]"};
std::vector<std::string> names{"INT",  "NAME", "SYMBOL"};


int main(int argc, char *argv[]) {
    std::locale old;
    std::locale::global(std::locale("en_US.UTF-8"));
    std::wstring_convert<std::codecvt_utf8<wchar_t>> codecvt_utf8;

    std::vector<std::wregex> rules;
    for (size_t i = 0, count = exprs.size(); i < count; ++i) {
        rules.push_back(std::wregex(L"^" + exprs[i]));
        printf("%s ^%s\n", names[i].c_str(), codecvt_utf8.to_bytes(exprs[i]).c_str());
    }

    printf("--------------------------------\n");
    for (int row = 0; row < argc - 1; ++row) {
        std::wstring code = codecvt_utf8.from_bytes(argv[row + 1]);
        size_t position = 0;
        while (true) {
            while (position < code.size() && (code[position] == L' ' || code[position] == L'\t'))
                position += 1;
            if (position >= code.size())
                break;

            auto subcode = code.substr(position);
            std::wsmatch match;
            int tokenType = -1;
            for (size_t i = 0, count = rules.size(); i < count; ++i) {
                if (std::regex_search(subcode, match, rules[i])) {
                    tokenType = i;
                    break;
                }
            }

            if (tokenType >= 0) {
                auto source = match.str(0);
                printf("%s\t`%s`\t%d\t%ld\n",
                    names[tokenType].c_str(), codecvt_utf8.to_bytes(source).c_str(), row, position);
                position += source.size();
            } else {
                printf("error in: %d, %ld\n", row, position);
                break;
            }
        }
    }

    std::locale::global(old);
    return 0;
}

NextLexical analysis with bklexer

Recommended Today

Envoy announced alpha version of native support for windows

Author: sunjay Bhatia Since 2016, porting envoy to the windows platform has been an important part of the projectOne of the goalsToday, we are excited to announce the alpha version of envoy’s windows native support. The contributor community has been working hard to bring the rich features of envoy to windows, which is another step […]