This article is reproduced from: dbaplus community
Introduction to the author
Du Hongjun,Software Engineer of Jingdong Digital Technology Co., Ltd., with many years of middleware development and system design experience, has in-depth understanding of spring, mybatis and other related open source technologies. At present, I am responsible for the development of SQL parsing in the sharding sphere team.
Compared with other programming languages, SQL is relatively simple. However, it is still a perfect programming language, so there is no essential difference between parsing SQL syntax and parsing other programming languages (such as Java language, C language, go language, etc.).
When it comes to SQL parsing, we have to talk about text recognition. Text recognition is to recognize each part of the input text according to the given rules, and then output according to the specific data format. Tree structure is the most common way to output, which is commonly known as the abstract syntax tree (AST).
As a developer, text recognition works with us every day. After writing the code, the compiler first needs to parse the code according to the syntax of the program, that is, text recognition, and generate intermediate code.
SQL parsing is similar to program code parsing. It parses SQL text according to SQL syntax, identifies each part of the text, and then outputs it in the form of abstract syntax tree. SQL is also a programming language, and its syntax is no simpler than other programming languages. It is also normal for a complex table building statement to occupy more than 20 K bytes.
Whether parsing SQL or parsing the syntax of other programming languages, a special parser is needed. It takes a long time to develop from scratch, and the SQL dialects of various databases are not the same. This is not a set of universal SQL parsing engine. In today’s perfect third-party libraries, it’s much better to find a sharp tool than to develop this slash and burn method from scratch. Open source SQL parsers include jsqlparser, FDB, Druid, etc., and syntax parsers mainly include ANTLR, JavaCC, etc.
Jsqlparser is a general SQL parser, which provides one-stop SQL parsing ability, transforms SQL into syntax tree, and provides tree access interface for program to traverse syntax tree. Although it is convenient to use, it also has some disadvantages
- Unable to generate parser based on required syntax. As for the syntax needed for data fragmentation, it is not as lightweight as ANTLR, which can write grammar rules according to its own needs;
- It only supports some commonly used standard SQL syntax, such as alter table, alter index, DCL and dialects of various databases;
- It uses visitor mode to encapsulate the abstract syntax tree completely. The peripheral program can’t access the abstract syntax tree directly. When it doesn’t need to traverse the tree completely, the code is cumbersome.
FDB and driuid are of the same type as jsqlparser. They don’t need custom SQL syntax and can be used immediately, but they lack the flexibility of custom syntax.
Relatively speaking, ANTLR is better. It is not a custom parser for SQL parsing, but a general programming language parser. It only needs to write a grammar file named G4 to automatically generate parsing code and output it in a unified format, which is very simple to handle. Because G4 files are customized by developers, the code generated by ANTLR is more concise and personalized. When writing syntax rules that only apply to data fragmentation, a large number of SQL syntax that need not be concerned can be simplified. For SQL audit and other SQL parsing requirements, you can use antrl to write another syntax rule, which can achieve the effect of adapting to local conditions. JavaCC, like ANTLR, is a parser of custom syntax type.
No matter which parser is used, the parsing process is consistent, which is divided into lexer and parser.
1. Lexical analysis
First, the SQL is divided into several non separable lexical units (tokens) by lexical parser. In SQL syntax, lexical units are usually divided into keywords, identifiers, literal quantities, operators and separators.
- keyword:The special words used by database engine are reserved characters and cannot be used as identifiers;
- Identifier:In SQL syntax, it is table name, column name, etc. Corresponding to programming language, it is package name, class name, method name, variable name, attribute name and so on;
- Literal quantity:It includes string and numeric value;
- Operator:It includes addition, subtraction, multiplication and division, bit operation, logic operation, etc;
- Boundary:Comma, semicolon, bracket, etc.
The lexical parser reads one character at a time. When the current character is not consistent with the previous character, it completes the recognition of a lexical unit. For example, when reading select, the first character is’s’, which satisfies the rule of keyword and identifier, the second character ‘e’ also satisfies the rule, and so on, until the seventh character is a space, which does not satisfy the rule, then the recognition of a lexical unit is completed. Select is not only a keyword defined by SQL specification, but also satisfies the identifier rule. Therefore, when a lexical unit is an identifier, the parser needs to have priority judgment and first determine whether it is a keyword. Other rules are relatively simple, such as: characters starting with numbers are read according to the literal amount of numerical rules; characters starting with double or single quotation marks are read according to the literal amount of string rules; operators or separators are easier to recognize. For example, the following SQL:
SELECT id, name FROM product WHERE id > 10;
The identified lexical units are as follows:
- Keywords: select, from, where
- Identifier: ID, name, product
- Literal size: 10
- Operator: >
- Separators: and;
2. Grammatical analysis
The grammar parser gets one lexical unit from the lexical parser at a time. If the rule is satisfied, the extraction and matching of the next lexical unit will continue until the end of the string; if the rule is not satisfied, an error will be prompted and the parsing will end.
The difficulty of syntax parsing lies in the loop processing of rules, the selection of branches, recursive calls and complex expressions.
When processing circular rules, when a rule is matched, the matching rule needs to match the current rule circularly again. When it is no longer the current rule definition, the subsequent rule matching can continue. Take the CREATE TABLE statement as an example. Each table can contain multiple columns, and each column may need to define parameters such as name, type and precision.
When there are multiple branch paths in a rule, it needs to search ahead, and the parser must match each possible branch to determine the correct path. Take the ALTER TABLE statement as an example.
Modify the table name syntax as follows:
ALTER TABLE oldTableName RENAME TO newTableName;
The syntax for deleting a column is:
ALTER TABLE tableName DROP COLUMN columnName;
Both statements start with alter table, and the syntax of merging them is as follows:
ALTER TABLE tableName (RENAME TO newTableName | DROP COLUMN columnName);
To match the two branch options after the completion of tablename, you need to search ahead to determine the correct branch.
When a branch is selected, it may appear that one branch is a subset of the other. At this time, when the short path is successfully matched, the long path needs to be further matched. When the long path cannot be matched, the short path is selected, which is called greedy matching. If the greedy matching algorithm is not used, the longest branch rule will never be matched.
When the lexical unit does not meet one optional rule, it needs to match with the next rule until the matching is successful or the matching with the next non optional rule fails. In the CREATE TABLE statement, there are many options when defining a column, such as whether it is empty, whether it is a primary key, whether there are constraints, etc.
The syntax parser finally transforms SQL into an abstract syntax tree. For example, the following SQL:
SELECT id, name FROM t_user WHERE status = 'ACTIVE' AND age > 18
After parsing, it is an abstract syntax tree, as shown in the figure below:
In order to make it easy to understand, the token of the keyword in the abstract syntax tree is represented by green, the token of the variable is represented by red, and the gray representation needs to be further split.
Grammatical parsing is more complex than lexical parsing, and the rules of lexical parsing are relatively simple. The rules of lexical unit can be defined well, and branch selection is rare. Moreover, lexical unit can be determined by searching one character in advance. However, it is difficult to parse correctly.
There are many third-party tools for generating abstract syntax trees, and ANTLR is a good choice. It generates java code of abstract syntax tree from developer defined rules and provides visitor interface. Compared with code generation, handwritten abstract syntax tree is more efficient in execution efficiency, but the workload is also large. In high performance scenarios, we can consider customizing abstract syntax tree.
TLR recognition tool is a language recognition tool written in Java. It can generate parsers automatically and generate parsers of target language directly from ANTLR syntax rules written by users. It can generate parsers of Java, go, C and other languages.
The parser client generated by ANTLR generates an abstract syntax tree from the input text, and provides an interface to traverse the tree to access all parts of the text. The implementation of ANTLR is consistent with the lexical analysis and grammatical analysis mentioned above. The lexical analyzer splits the lexical units according to the grammar rules; the parser does semantic analysis on the lexical units, optimizes the rules and eliminates the left recursion.
2. ANTLR syntax rules
The main work of ANTLR grammar rules is to define lexical parsing rules and grammar parsing rules. According to ANTLR, lexical parsing rules begin with upper case letters, and grammatical parsing rules begin with lower case letters. Here is a brief introduction to the rules of ANTLR.
First, you need to define the grammar type and name, which must be the same as the file name. There are four syntax types: lexer, parser, tree and combine.
- Lexer defines lexical analysis rules;
- Parser defines parsing rules;
- Tree is used to traverse the parsing tree;
- Combination can define both syntax analysis rules and lexical analysis rules, and the name of rules follows the above rules;
- Import is used to import syntax rules. Using import grammar rules classification can make grammar rules clearer; and using object-oriented idea to design rules file can make it have the idea of polymorphism and inheritance. It is worth noting that the current rule takes precedence over the import rule.
The name and content of the rule are separated by a colon and ended by a semicolon. For example:
The name of the rule is num, which starts with a capital letter, so it is the rule of lexical analysis; the content of the rule is [0-9] +, which means all integers.
ANTLR rule is based on BNF normal form. It uses’ | ‘to indicate branch option,’ * ‘to indicate match the previous match 0 or more times, and’ + ‘to indicate match the previous match at least once.
For other parts of the grammar, please refer to the official documents if you are interested.
ANTLR generate SQL parser, the first thing is to define the lexical parser and syntax parser of SQL, the following one by one.
3. Lexical analysis of ANTLR
The same as the previous SQL parsing principle, ANTLR’s lexical parsing also splits SQL into lexical units. When ANTLR parses lexical rules, it doesn’t understand the specific meaning of the rules. It doesn’t know which rules are keyword definitions and which rules are identifier definitions. It will number each rule according to the reading order. The rules with the top number will be matched first. If the matching is successful, it will directly return to the lexical unit. When designing lexical splitting rules, we need to place the identifier rule after the keyword rule to ensure that the keyword matching fails before matching the identifier.
ANTLR uses state transition table to realize character matching. It converts the lexical splitting rules into tables, reads one character at a time, queries the table according to the current character type and current status, and judges whether the characters read in match the rules. If the rules match, the character is accepted and the next character is read; if the rules do not match, the character is rejected. In this case, if the current state is the acceptable state of successfully matching a lexical unit, the lexical unit will be returned; otherwise, an error will be prompted. And so on. If the character is accepted, the next character is read. Until a lexical unit is returned successfully or an error is prompted when the matching fails.
For example, the following is a simple query sentence morphology splitting rule:
lexer grammar SelectLexer; SELECT: [Ss] [Ee] [Ll] [Ee] [Cc] [Tt]; FROM: [Ff] [Rr] [Oo] [Mm]; WHERE: [Ww] [Hh] [Ee] [Rr] [Ee]; LEFT: [Ll][Ee][Ff][Tt]; RIGHT: [Rr][Ii][Gg][Hh][Tt]; INNER: [Ii][Nn][Nn][Ee][Rr]; JOIN: [Jj] [Oo] [Ii] [Nn]; ON : [Oo][Nn]; BETWEEN: [Bb] [Ee] [Ee] [Rr] [Ee]; AND: [Aa] [Nn] [Dd]; OR:[Oo][Rr]; GROUP: [Gg] [Rr] [Oo] [Uu] [Pp]; BY:[Bb] [Yy]; ORDER: [Oo] [Rr] [Dd] [Ee] [Rr]; ASC:[Aa][Ss][Cc]; DESC:[Dd][Ee][Ss][Cc]; IN: [Ii][Nn]; ID: [a-zA-Z0-9]+; WS: [ ] + ->skip;
It defines a case insensitive keyword rule from select to in and an identifier rule ID. the identifier rule comes last. Ws rule means to skip when space, tab and newline are encountered. For any character in the input character, the corresponding rule must be found in the lexical analyzer, otherwise it will prompt failure. If you remove the WS rule, you will get the following error prompt for SQL with spaces.
The reason for the error is that the sixth, tenth and eleventh characters of the first line are carriage return newline characters, and the lexical rule cannot find the corresponding rule.
4. Syntax analysis of ANTLR
Syntax parsing of ANTLR is used to define phrase rules that make up sentences. Syntax rules are provided by various database manufacturers. Therefore, when parsing SQL, you only need to convert them into syntax rules of ANTLR. It should be noted that the rule definition of SQL expression is very complex. It includes not only the common mathematical expression and Boolean expression, but also the function call and the private date expression, window function, case statement of each database.
ANTLR also uses the state transition table to check whether lexical units meet the grammar rules. The parser calls the lexical analyzer to get the lexical unit and check whether it conforms to the rules. When multiple option branches are encountered, greedy matching principle is adopted to finish the longest path branch first. If more than one rule in the branch satisfies the condition, match in order.
Take the following rules as an example:
grammar Test; ID: [a-zA-Z0-9]+; WS: [ ] + ->skip; testAll:test1 |test2|test3|test21; test1:ID; test2:ID ID; test21:ID ID; test3:ID ID ID; test4:test1+;
Use the testall rule to do the following tests:
- When the input parameter is “A1 A2 A3”, the test3 branch is used instead of (test1 A1) (test1 A2) (test1 A3) or (test2 A1 A2) (test1 A3);
- When the input parameter is “A1 A2”, although the test21 rule can also match, there is a test2 rule in front of it, so the test2 rule is used;
- When the input parameter is “A1, A2 #”, an error is prompted because it cannot match #.
5. Segmentation context extraction
After the completion of SQL parsing, the last step is to extract the context needed for data fragmentation. Through the understanding of SQL, it refines the context needed for fragmentation by accessing the abstract syntax tree, and marks the position that may need to be rewritten. The parsing context for slicing includes select items, table information, sharding condition, auto increment primary key, order by, group by, limit, rownum, top, etc.
3、 SQL parsing in sharding sphere
As the core of sub database and sub table products, SQL parsing performance and compatibility are the most important indicators.
The predecessor of sharding sphere, sharding sphere before 1.4. X uses Druid as SQL parser. The actual test shows that its performance is far better than other parsers.
Starting from version 1.5. X, sharding sphere adopts a fully self-developed SQL parsing engine. Due to different purposes, sharding sphere doesn’t need to turn SQL into a complete abstract syntax tree, and it doesn’t need to traverse twice through accessor mode. It adopts the way of “half understanding” of SQL, and only refines the context of data fragmentation, so the performance and compatibility of SQL parsing have been further improved.
In the latest version of 3. X, sharding sphere tries to use ANTLR as the SQL parsing engine, and plans to replace the original parsing engine in the order of DDL > TCL > Dal – > DCL > DML – > DQL. The reason for using ANTLR is to hope that the parsing engine of sharding sphere can be more compatible with SQL. For complex expressions, recursion, subqueries and other statements, sharding sphere’s slicing core does not pay attention to them, but it will affect the friendliness of SQL understanding. The self-developed SQL parsing engine does not deal with these conveniences for the ultimate performance, and it will directly report errors when it is used.
Through the example test, the performance of ANTLR parsing SQL is about 3 times slower than the self-developed SQL parsing engine. To fill the gap, sharding sphere caches the syntax tree parsed by SQL using the prepared statement. Therefore, it is recommended to use Preparedstatement to improve the performance. Sharding sphere will provide configuration items to coexist the two parsing engines and let users decide the compatibility and performance of SQL parsing.
What are the plans for sharding sphere in the near future? Welcome to watch and leave a message in GitHub!
Sharding sphere has been continuously refined and developed since 2016, and has been recognized by more and more enterprises and individuals: it has gained 5000 + star, 2000 + forks and 60 + successful cases on GitHub. In addition, more and more enterprises and individuals have joined sharding sphere’s open source projects, contributing a lot to its growth and development.
We never stop to listen to the needs and suggestions of community partners, and constantly develop new and powerful functions to make them robust and reliable!
Open source is not easy, but we are willing to move towards the ultimate goal!
So, can you help us when you are reading? Sharing, forwarding, using, communicating and joining us are the biggest encouragement for us!
For more information, please visit the official website:
Scan code into group