Explore Presto SQL Engine (1) – skillfully use ANTLR

Time:2021-12-1

1、 Background

Since big data was first written into the government work report in 2014, big data has been developed for 7 years. The type of big data also extends from transaction data to interactive data and sensing data. The data scale has also reached the Pb level.

The scale of big data is so large that the acquisition, storage, management and analysis of data go beyond the capabilities of traditional database software tools. In this context, various big data related tools have emerged to meet the needs of various business scenarios. From hive, spark, presto, kylin and Druid of Hadoop ecology to Clickhouse and elasticsearch of non Hadoop Ecology

These big data processing tools have different characteristics and application scenarios, but the external interfaces or operation languages are similar, that is, each component supports SQL language. Only based on different application scenarios and characteristics, their respective SQL dialects are implemented. This requires relevant open source projects to implement SQL parsing by themselves. In this context, the syntax parser generator ANTLR, born in 1989, ushered in the golden age.

2、 Introduction

ANTLR is an open source syntax parser generator, which has a history of more than 30 years. It is an open source project that has stood the test of time. A program from source code to machine executable basically needs three stages: writing, compiling and execution.

In the compilation stage, lexical and grammatical analysis is required. ANTLR focuses on the problem of lexical and syntactic analysis of the source code to produce a tree analyzer. ANTLR supports parsing of almost all mainstream programming languages. fromantlr/grammars-v4As you can see, ANTLR supports dozens of programming languages such as Java, C, Python and SQL. Generally, we do not need to extend programming languages, so in most cases, these language compilation support is more for learning and research, or used in various development tools (NetBeans, IntelliJ) to verify syntax correctness and format code.

For SQL language, ANTLR will be more widely and deeply used, because hive, presto, sparksql, etc. need to customize the execution of SQL, such as realizing distributed query engine, realizing unique features in various big data scenarios, etc.

3、 Implementation of four arithmetic operations based on antlr4

At present, we mainly use antlr4. In the defining antlr4 reference, various interesting application scenarios based on antlr4 are introduced. For example: implement a calculator that supports four operations; Parsing and extracting formatted text such as JSON;

Convert JSON to XML; Extract interfaces from Java source code, etc. This section takes the implementation of four arithmetic calculators as an example to introduce the simple application of antlr4, paving the way for the later implementation of SQL parsing based on antlr4. In fact, supporting digital operation is also the basic ability that each programming language must have.

3.1 self coding implementation

Without antlr4, what should we do if we want to implement four operations? One idea is based on stack implementation. For example, without considering exception handling, the four simple operation codes implemented by ourselves are as follows:

package org.example.calc;
 
import java.util.*;
 
public class CalcByHand {
    //Define operators and prioritize, * / has higher priority
    public static Set<String> opSet1 = new HashSet<>();
    public static Set<String> opSet2 = new HashSet<>();
    static{
        opSet1.add("+");
        opSet1.add("-");
        opSet2.add("*");
        opSet2.add("/");
    }
    public static void main(String[] args) {
        String exp="1+3*4";
        //Split the expression into tokens
        String[] tokens = exp.split("((?<=[\\+|\\-|\\*|\\/])|(?=[\\+|\\-|\\*|\\/]))");
 
        Stack<String> opStack = new Stack<>();
        Stack<String> numStack = new Stack<>();
        int proi=1;
        //Put on different stacks based on type
        for(String token: tokens){
            token = token.trim();
 
            if(opSet1.contains(token)){
                opStack.push(token);
                proi=1;
            }else if(opSet2.contains(token)){
                proi=2;
                opStack.push(token);
            }else{
                numStack.push(token);
                //If the operator before the operand is a high priority operator, the calculated result is put on the stack
                if(proi==2){
                    calcExp(opStack,numStack);
                }
            }
        }
 
        while (!opStack.isEmpty()){
            calcExp(opStack,numStack);
        }
        String finalVal = numStack.pop();
        System.out.println(finalVal);
    }
     
    private static void calcExp(Stack<String> opStack, Stack<String> numStack) {
        double right=Double.valueOf(numStack.pop());
        double left = Double.valueOf(numStack.pop());
        String op = opStack.pop();
        String val;
        switch (op){
            case "+":
                 val =String.valueOf(left+right);
                break;
            case "-":
                 val =String.valueOf(left-right);
                break;
            case "*":
                val =String.valueOf(left*right);
                break;
            case "/":
                val =String.valueOf(left/right);
                break;
            default:
                throw new UnsupportedOperationException("unsupported");
        }
        numStack.push(val);
    }
}

The amount of code is small. The data structure stack feature is used, and the operator priority needs to be controlled by itself. The feature does not support bracket expression or expression assignment. Next, let’s look at the implementation using antlr4.

3.2 implementation based on antlr4

The basic process of programming with antlr4 is fixed, which is usually divided into the following three steps:

  • Based on the requirements, the semantic rules of user-defined syntax are written according to the rules of antlr4 and saved into a file with G4 as the suffix.
  • Use antlr4 tool to process G4 files and generate lexical analyzer, parser code and dictionary files.
  • Write code to inherit the visitor class or implement the listener interface to develop your own business logic code.

Based on the above process, we analyze the details with the help of existing cases.

Step 1: define the syntax file based on antlr4 rules, and the file name is suffixed with G4. For example, the syntax rule file that implements the calculator is named labeledexpr.g4. The contents are as follows:

grammar LabeledExpr; // rename to distinguish from Expr.g4
 
prog:   stat+ ;
 
stat:   expr NEWLINE                # printExpr
    |   ID '=' expr NEWLINE         # assign
    |   NEWLINE                     # blank
    ;
 
expr:   expr op=('*'|'/') expr      # MulDiv
    |   expr op=('+'|'-') expr      # AddSub
    |   INT                         # int
    |   ID                          # id
    |   '(' expr ')'                # parens
    ;
 
MUL :   '*' ; // assigns token name to '*' used above in grammar
DIV :   '/' ;
ADD :   '+' ;
SUB :   '-' ;
ID  :   [a-zA-Z]+ ;      // match identifiers
INT :   [0-9]+ ;         // match integers
NEWLINE:'\r'? '\n' ;     // return newlines to parser (is end-statement signal)
WS  :   [ \t]+ -> skip ; // toss out whitespace

(Note: this case file comes from the defined antlr4 reference)

Briefly interpret the labeledexpr.g4 file. Antlr4 rules are defined based on regular expressions. The understanding of rules is top-down, and each statement ending with a semicolon represents a rule. For example, the first line:grammar LabeledExpr; Indicates that our syntax name is labeledexpr, which needs to be consistent with the file name. Java coding has a similar rule: the class name is consistent with the class file.

Rule progIndicates that prog is one or more stats.

Rule statAdapt three sub rules: empty line, expression expr, assignment expression ID ‘=’ expr.

Expression exprFit five sub rules: multiplication and division, addition and subtraction, integer, ID, bracket expression. Obviously, this is a recursive definition.

Finally, it defines the basic elements of composite rules, such as rulesID: [a-zA-Z]+Indicates that the ID is limited to uppercase and lowercase English strings;INT: [0-9]+; The rule indicating int is one or more numbers between 0-9. Of course, this definition is not strict. More strictly, its length should be limited.

Based on the understanding of regular expressions, the G4 syntax rules of antlr4 are easy to understand.

When defining antlr4 rules, you need to pay attention to a situation that one string may support multiple rules at the same time, such as the following two rules:

ID: [a-zA-Z]+;

FROM: ‘from’;

Obviously, the string “from” satisfies the above two rules at the same time, and the processing method of antlr4 is determined according to the defined order. Here, the ID is defined before the from, so the string from will match the ID rule first.

In fact, antlr4 has completed 50% of the work for us after writing the G4 file in the definition and regulations: it has helped us implement the whole architecture and interface, and the rest of the development work is to carry out specific implementation based on interface or abstract class. There are two ways to process the generated syntax tree, one is visitor mode, the other is listener mode.

3.2.1 using visitor mode

Step 2: use antlr4 tool to parse G4 file and generate code. That is, ANTLR tool parses G4 files and automatically generates basic code for us. The flow chart is as follows:

Explore Presto SQL Engine (1) - skillfully use ANTLR

The command line is as follows:

antlr4 -package org.example.calc -no-listener -visitor .\LabeledExpr.g4

After the command is executed, the generated files are as follows:

$ tree .
.
├── LabeledExpr.g4
├── LabeledExpr.tokens
├── LabeledExprBaseVisitor.java
├── LabeledExprLexer.java
├── LabeledExprLexer.tokens
├── LabeledExprParser.java
└── LabeledExprVisitor.java

First, develop the entry class calc.java. Calc class is the entrance of the whole program. The core code of calling the lexer and parser classes of antlr4 is as follows:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
 
EvalVisitor eval = new EvalVisitor();
eval.visit(tree);

Next, define the class to inherit labeledexprbasevisitor class. The override method is as follows:

Explore Presto SQL Engine (1) - skillfully use ANTLR

As can be seen from the figure, the generated code corresponds to the rule definition. For example, visitaddsub corresponds to the addsub rule, and visitid corresponds to the ID rule. And so on… The code to realize addition and subtraction is as follows:

/** expr op=('+'|'-') expr */
@Override
public Integer visitAddSub(LabeledExprParser.AddSubContext ctx) {
    int left = visit(ctx.expr(0));  // get value of left subexpression
    int right = visit(ctx.expr(1)); // get value of right subexpression
    if ( ctx.op.getType() == LabeledExprParser.ADD ) return left + right;
    return left - right; // must be SUB
}

Quite intuitive. After the code is written, Calc is run. Run Calc’s main function, enter the corresponding operation expression on the interactive command line, and the operation result can be seen by line feed Ctrl + D. For example, 1 + 3 * 4 = 13.

3.2.2 using listener mode

Similarly, we can also use listener mode to implement four operations. The command line is as follows:

antlr4 -package org.example.calc -listener .\LabeledExpr.g4

The execution of this command will also produce framework code for us. Based on the framework code, we can develop the entry class and interface implementation class. First, develop the entry class calc.java. Calc class is the entrance of the whole program. The code of calling the lexer and parser classes of antlr4 is as follows:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
 
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(new EvalListener(), tree);

As like as two peas, we can see that the calling logic of generating ParseTree is exactly the same. The code to implement the listener is slightly more complex, and the data structure of stack is also required, but only one operand stack is required, and there is no need to control the priority by itself. Take addsub as an example:

@Override
public void exitAddSub(LabeledExprParser.AddSubContext ctx) {
    Double left = numStack.pop();
    Double right= numStack.pop();
    Double result;
    if (ctx.op.getType() == LabeledExprParser.ADD) {
        result = left + right;
    } else {
        result = left - right;
    }
    numStack.push(result);
}

Take the operand directly from the stack for operation.

3.2.3 summary

The definition ANTLR 4 reference clearly explains the difference between listener mode and visitor mode:

Listener mode:

Explore Presto SQL Engine (1) - skillfully use ANTLR

Visitor mode:

Explore Presto SQL Engine (1) - skillfully use ANTLR

  • The listener pattern traverses itself through the walker object without considering the hierarchical relationship of its syntax tree. Vistor needs to control the child nodes accessed by itself. If a child node is omitted, the whole child node cannot be accessed.
  • The listener mode method has no return value, and the vistor mode can set any return value.
  • The access stack of listener mode is clear. Vistor mode is a method call stack. If the implementation fails, it may lead to stackoverflow.

Through this simple example, we drive antlr4 to implement a simple calculator. Learned the application process of antlr4. Understand the definition of G4 syntax file, visitor mode and listener mode. Through antlr4, we generate a parsetree, access the parsetree based on visitor mode and listener mode, and realize four operations.

Based on the above examples, it can be found that without antlr4, our own algorithm can achieve the same function. However, using ANTLR does not care about the parsing process of expression string, but only about the specific business implementation, which is very worry-free and easy.

More importantly, antlr4 provides more imaginative abstract logic than its own implementation, rising to the height of methodology, because it is no longer limited to solving a certain problem, but solving a class of problems. It can be said that compared with the idea of solving problems by hard coding, ANTLR is like the gap between ordinary area formula and calculus in the field of mathematics.

4、 Refer to Presto source code to develop SQL parser

The purpose of using antlr4 to realize four operations is to understand the application mode of antlr4. The following figure shows our real purpose: To study how antlr4 implements SQL statement parsing in presto.

Supporting complete SQL syntax is a huge project. There is a complete sqlbase.g4 file in presto, which defines all SQL syntax supported by presto, including DDL syntax and DML syntax. The document system is relatively large and is not suitable for learning to explore a specific detail.

In order to explore the process of SQL parsing and understand the logic behind SQL execution, I chose to do my own coding experiment on the basis of simply reading relevant materials and documents. To do this, define a small goal: to implement an SQL parser. The parser is used to implement the select field from table syntax and query the specified field from the local CSV data source.

4.1 clipping selectbase.g4 file

Based on the same process as implementing the four arithmetic operators, first define the selectbase.g4 file. With the Presto source code as the reference frame, our selectbase.g4 does not need to be developed by ourselves, but only needs to be tailored based on the Presto G4 file. The cut content is as follows:

grammar SqlBase;
 
tokens {
    DELIMITER
}
 
singleStatement
    : statement EOF
    ;
 
statement
    : query                                                            #statementDefault
    ;
 
query
    :  queryNoWith
    ;
 
queryNoWith:
      queryTerm
    ;
 
queryTerm
    : queryPrimary                                                             #queryTermDefault
    ;
 
queryPrimary
    : querySpecification                   #queryPrimaryDefault
    ;
 
querySpecification
    : SELECT  selectItem (',' selectItem)*
      (FROM relation (',' relation)*)?
    ;
 
selectItem
    : expression  #selectSingle
    ;
 
relation
    :  sampledRelation                             #relationDefault
    ;
 
expression
    : booleanExpression
    ;
 
booleanExpression
    : valueExpression             #predicated
    ;
 
valueExpression
    : primaryExpression                                                                 #valueExpressionDefault
    ;
 
primaryExpression
    : identifier                                                                          #columnReference
    ;
 
sampledRelation
    : aliasedRelation
    ;
 
aliasedRelation
    : relationPrimary
    ;
 
relationPrimary
    : qualifiedName                                                   #tableName
    ;
 
qualifiedName
    : identifier ('.' identifier)*
    ;
 
identifier
    : IDENTIFIER             #unquotedIdentifier
    ;
 
SELECT: 'SELECT';
FROM: 'FROM';
 
fragment DIGIT
    : [0-9]
    ;
 
fragment LETTER
    : [A-Z]
    ;
 
IDENTIFIER
    : (LETTER | '_') (LETTER | DIGIT | '_' | '@' | ':')*
    ;
 
WS
    : [ \r\n\t]+ -> channel(HIDDEN)
    ;
 
// Catch-all for anything we can't recognize.
// We use this to be able to ignore and recover all the text
// when splitting statements with DelimiterLexer
UNRECOGNIZED
    : .
    ;

Compared with more than 700 lines of rules in Presto source code, we cut it to 1 / 10 of its size. The core rule of this file is: select selectitem (‘,’ selectitem) (FROM relation (‘,’ relation))

By understanding the G4 file, we can also more clearly understand the composition of our query statements. For example, usually our most common query data source is a data table. However, in SQL syntax, our query data table is abstracted into relation.

This relationship may come from a specific data table, or a sub query, or a join, or a sample of data, or the unnest of an expression. In the field of big data, such expansion will greatly facilitate data processing.

For example, use unnest syntax to parse complex types of data. SQL is as follows:

Explore Presto SQL Engine (1) - skillfully use ANTLR

Although SQL is complex, its structure division can be clearly understood by understanding G4 files. Returning to the selectbase.g4 file, we also use the antlr4 command to process the G4 file and generate the code:

antlr4 -package org.example.antlr -no-listener -visitor .\SqlBase.g4

This generates the basic framework code. The next step is to handle the business logic by yourself.

4.2 traversing syntax tree to encapsulate SQL structure information

Next, define the node type of the syntax tree based on SQL syntax, as shown in the following figure.

Explore Presto SQL Engine (1) - skillfully use ANTLR

Through this class diagram, you can clearly see the basic elements in SQL syntax.

Then implement its own parsing class astbuilder based on the visitor pattern (here, in order to simplify the problem, it is still cut from the Presto source code). Take the queryspecification rule code as an example:

@Override
public Node visitQuerySpecification(SqlBaseParser.QuerySpecificationContext context)
{
    Optional<Relation> from = Optional.empty();
    List<SelectItem> selectItems = visit(context.selectItem(), SelectItem.class);
 
    List<Relation> relations = visit(context.relation(), Relation.class);
    if (!relations.isEmpty()) {
        // synthesize implicit join nodes
        Iterator<Relation> iterator = relations.iterator();
        Relation relation = iterator.next();
 
        from = Optional.of(relation);
    }
 
    return new QuerySpecification(
            getLocation(context),
            new Select(getLocation(context.SELECT()), false, selectItems),
            from);
}

Through the code, we have parsed the data source and specific fields of the query and encapsulated them in the queryspecification object.

4.3 using statement object to realize data query

Through the previous examples of implementing the four operators, we know that ANTLR parses the statements entered by the user into parsetree. Business developers can resolve parsetree through relevant interfaces by themselves. Presto generates parsetree by parsing the input SQL statement, traverses the parsetree, and finally generates the statement object. The core code is as follows:

SqlParser sqlParser = new SqlParser();
Statement statement = sqlParser.createStatement(sql);

With the statement object, how do we use it? Combined with the previous class diagram, we can find that:

  • A statement of type query has the querybody attribute.
  • Querybody of queryspecification type has a select attribute and a from attribute.

Through this structure, we can clearly obtain the necessary elements for implementing the select query:

  • Get the target table to be queried from the from attribute. Here, the Convention table name is consistent with the CSV file name.
  • Get the target field selectitem to be queried from the select property. The first line of CSV is the title line.

The whole business process is clear. After parsing the SQL statement to generate the statement object, follow the following steps:

  • S1: get the data table and fields of the query.
  • S2: set the name of the data table to the data file and read the data of the data file.
  • S3: format the output field name to the command line.
  • S4: format the output field contents to the command line.

In order to simplify the logic, the code only deals with the main line without exception handling.

/**
 *Get the table name and field name to be queried
 */
QuerySpecification specification = (QuerySpecification) query.getQueryBody();
Table table= (Table) specification.getFrom().get();
List<SelectItem> selectItems = specification.getSelect().getSelectItems();
List<String> fieldNames = Lists.newArrayList();
for(SelectItem item:selectItems){
    SingleColumn column = (SingleColumn) item;
    fieldNames.add(((Identifier)column.getExpression()).getValue());
}
 
/**
 *Determine the data source file of the query based on the table name
 */
String fileLoc = String.format("./data/%s.csv",table.getName());
 
/**
 *Reads the specified field from the CSV file
 */
Reader in = new FileReader(fileLoc);
Iterable<CSVRecord> records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse(in);
List<Row> rowList = Lists.newArrayList();
for(CSVRecord record:records){
    Row row = new Row();
    for(String field:fieldNames){
        row.addColumn(record.get(field));
    }
    rowList.add(row);
}
 
/**
 *Format output to console
 */
int width=30;
String format = fieldNames.stream().map(s-> "%-"+width+"s").collect(Collectors.joining("|"));
System.out.println( "|"+String.format(format, fieldNames.toArray())+"|");
 
int flagCnt = width*fieldNames.size()+fieldNames.size();
String rowDelimiter = String.join("", Collections.nCopies(flagCnt, "-"));
System.out.println(rowDelimiter);
for(Row row:rowList){
    System.out.println( "|"+String.format(format, row.getColumnList().toArray())+"|");
}

The code is for demonstration only, and the abnormal logic is not considered temporarily, such as the query field does not exist, the field name defined in CSV file does not meet the requirements, etc.

4.4 effect display

In our project data directory, the following CSV files are stored:

Explore Presto SQL Engine (1) - skillfully use ANTLR

The sample data of cities.csv file is as follows:

"LatD","LatM","LatS","NS","LonD","LonM","LonS","EW","City","State"
   41,    5,   59, "N",     80,   39,    0, "W", "Youngstown", OH
   42,   52,   48, "N",     97,   23,   23, "W", "Yankton", SD
   46,   35,   59, "N",    120,   30,   36, "W", "Yakima", WA
   42,   16,   12, "N",     71,   48,    0, "W", "Worcester", MA

Run the code to query the data. Use SQL statements to specify fields to query from CSV files. The final effect of SQL query is as follows:

SQL example 1: select city, city from cities

Explore Presto SQL Engine (1) - skillfully use ANTLR

SQL example 2: select name, age from employee

Explore Presto SQL Engine (1) - skillfully use ANTLR

This section describes how to cut G4 rule files based on Presto source code, and then query data from CSV files with SQL statements based on antlr4. The coding experiment based on the cutting of Presto source code can play a certain role in studying the implementation of SQL Engine and understanding Presto source code.

5、 Summary

Based on four arithmetic operators and two cases of using SQL to query CSV data, this paper expounds the application idea and process of antlr4 in project development. The relevant codes can be found ingithubI saw it on the. Understanding the usage of antlr4 can help understand the definition rules and execution process of SQL, and assist in writing efficient SQL statements in business development. At the same time, it is also helpful to understand the compilation principle, define your own DSL and abstract business logic. I feel shallow on paper. I absolutely know that I have to practice it. It is also fun to study the source code implementation in the way described in this paper.

reference material

1、《The Definitive ANTLR4 Reference》

2、Presto official documents

3、ANTLR 4 concise tutorial

4、Calc class source code

5、Evalvisitor class source code

6、Presto source code

Author: vivo Internet development team Shuai Guangying