“Distributed technology topic” analyzes the parsing and execution process of an SQL


The omnipotent program ape spits out a magic [SQL]. In an instant, IO is like thousands of troops and horses running, the memory is like whale sucking cattle drinking, the sea contains hundreds of rivers, and the CPU is running at a high speed with 360% load. In an instant, a wonderful figure appears

Behind an SQL, what does the database do,This article will talk about the parsing and execution process of SQL in simple terms.

1、 Introduction to SQL

SQL is a concise data operation language invented based on relational database in the 1970s.

SQL byfunctionCan be divided into the followingThree types

Data Definition Language DDL

It is mainly used to create library tables and indexes, set field types, and specify storage and compression formats.

Data control language DCL

It is mainly used to create users and assign role permissions.

Data manipulation language DML

It mainly refers to data addition, deletion, modification and query.

Relationship between business model and SQL:SQL is the essence of business, such as the following common businesses in the banking or securities industry.

OLAP type business

The bank calculates a person’s credit score and risk rating

Securities companies count the stock preference of an age group

The bank labels the customer base and formulates user profiles

OLTP type service

Individuals in the stock market Kaihu or buy a stock

Various businesses handled at the counter

Various operations of mobile app users

Each business model will eventually be transformed into the execution of an SQL statement. SQL contains the name of the entity table you want to query, grouping, sorting fields, filter conditions, etc.

2、 SQL lifecycle

In life, we usually set goals first, then make plans, and finally practice.

In the database, an SQL statement is the goal to be completed; The SQL will be parsed by the compiler to generate an execution plan; Finally, the executor completes the actual data operation in the storage engine.

The detailed life cycle can be divided into connection establishment, lexical and syntax parsing, logic planning, RBO and CBO optimization, execution planning, permission check, resource scheduling, distributed task execution, return of final results, etc.

1. Establish connection

The client uses JDBC or ODBC protocol to submit an SQL to the server.

2. Lexical and grammatical analysis

Lexical analysis is the first stage of the compilation process. The purpose is to convert various input symbols into corresponding identifiers, which can be processed by subsequent stages.

Lexical analysis programs are generally called lexical analyzer or scanner.


Syntax analysis is a logical stage in the compilation process; The task of this stage is to combine word sequences into various grammatical phrases and expressions based on lexical analysis. The parser is generally called parser.

The essence of SQL parsing is language conversion, which is to convert text codes into data structures that can be described by computer language.

Introduction to common SQL compilation tools

Lexical compiler is a tool for generating lexical analyzer (scanner), which uses regular expressions to describe each lexical unit.

Yacc is a classic tool for generating parser, which adopts bottom-up (LALR) parsing method. All the syntax of any programming language can be translated into yacc syntax parser for that language. However, the generated code is generally obscure and difficult to understand.

Lex and yacc can be used together.

ANTLR is a syntax parser generator written in Java and implemented based on top-down recursive descent ll algorithm.

The result of SQL parsing is to generate an abstract syntax tree ast (abstract syntax tree).

3. Logical plan

The ast will be parsed into a logical plan, including the data processing logic and sequence written by the user.

Logical processing order refers to how an SQL statement should be executed and when each keyword and clause part should be executed.


It should be noted that the execution order of SQL is not from top to bottom and from left to right in writing order.

The actual data processing sequence of a standard SQL is listed below:

From < name of left Table >

On < join condition >

< type of join > < name of right table >

Where < where conditions >

Group by < field of group by >

Having < having condition >


Distinct < field to query >

Order by < condition of order by >

Limit < number of limit >

It can be seen that the order of operations is to first select the table to be operated, then filter with on and where conditions, then reorganize the partition according to the business, and finally sort.

4. RBO and CBO optimization

Due to the different capabilities of users, the SQL scripts written by the same business may vary greatly, which may lead to that the logical plan may not be the optimal execution path. So the logical plan needs to be optimized.

Databases generally have two kinds of query optimizers:

RBO: rule based optimization

The optimizer determines the execution plan of SQL according to a series of rules hard coded in the database. For example, when querying, the priority of index is higher than that of full table scanning, which belongs to RBO optimization, and predicate push down also belongs to RBO optimization.

CBO: cost based optimization

The optimizer generates multiple execution plans by converting relational expressions according to optimization rules, and then CBO will calculate the “cost” of various possible “execution plans”, that is, cost, according to statistics and cost model, and select the execution scheme with the lowest cost as the actual operation scheme.

Statistics include table data volume, io of execution path, network IO, and CPU usage. Pre analysis of tables is to pre count the distribution and data volume of tables, which belongs to CBO optimization.

5. Implementation plan

After the logical plan is optimized by RBO and CBO, an execution plan will be generated.

In distributed database, the execution plan can be executed in parallel, and the relationship between tasks can be regarded as a directed acyclic graph DAG.


The execution plan in the figure above is divided into four subtasks, and each task can be submitted to one or more nodes for execution.

6. Authority control

When submitting SQL, the user is attached with personal identity authentication information. Permission control will verify whether the current user has corresponding DDL, DCL, DML and other permissions in SQL.

7. Resource scheduling

After the permission is approved, the task will be sent to the task queue of each execution node. Each task needs to apply for the corresponding CPU and memory resources before it can be really executed.

Under the large data volume, the consumption of CPU and memory resources is very huge, which is why the concurrent query of large data is sometimes as slow as a snail.

8. Distributed task execution

As mentioned earlier, the relationship between distributed tasks can be regarded as a directed acyclic graph DAG. After each sub task is completed, the result will continue to be executed as the input of the next task until the top-level task is completed.

Generally speaking, the lowest level tasks (with tablescan operator) are IO intensive and need to read the required data from the disk. Intermediate tasks (with filter or join operator) need CPU to judge or do a lot of calculations such as Cartesian product operation, while some sorting operations require a lot of memory.

9. Return to the final result

When all tasks are completed and summarized to the root node, the server will return the SQL execution results to the client.

The above is to analyze the parsing and execution process of an SQL,“Distributed technology topic” is carefully compiled by the domestic database Hubble team. The topic will be continuously updated. You are welcome to keep your attention.