The omnipotent program ape spits out a magic [SQL]. In an instant, IO is like thousands of troops and horses running, the memory is like whale sucking cattle drinking, the sea contains hundreds of rivers, and the CPU is running at a high speed with 360% load. In an instant, a wonderful figure appears
Behind an SQL, what does the database do,This article will talk about the parsing and execution process of SQL in simple terms.
SQL is a concise data operation language invented based on relational database in the 1970s.
SQL byfunctionCan be divided into the followingThree types：
Data Definition Language DDL
It is mainly used to create library tables and indexes, set field types, and specify storage and compression formats.
Data control language DCL
It is mainly used to create users and assign role permissions.
Data manipulation language DML
It mainly refers to data addition, deletion, modification and query.
Relationship between business model and SQL:SQL is the essence of business, such as the following common businesses in the banking or securities industry.
OLAP type business
The bank calculates a person’s credit score and risk rating
Securities companies count the stock preference of an age group
The bank labels the customer base and formulates user profiles
OLTP type service
Individuals in the stock market Kaihu or buy a stock
Various businesses handled at the counter
Various operations of mobile app users
Each business model will eventually be transformed into the execution of an SQL statement. SQL contains the name of the entity table you want to query, grouping, sorting fields, filter conditions, etc.
In life, we usually set goals first, then make plans, and finally practice.
In the database, an SQL statement is the goal to be completed; The SQL will be parsed by the compiler to generate an execution plan; Finally, the executor completes the actual data operation in the storage engine.
The detailed life cycle can be divided into connection establishment, lexical and syntax parsing, logic planning, RBO and CBO optimization, execution planning, permission check, resource scheduling, distributed task execution, return of final results, etc.
The client uses JDBC or ODBC protocol to submit an SQL to the server.
Lexical analysis is the first stage of the compilation process. The purpose is to convert various input symbols into corresponding identifiers, which can be processed by subsequent stages.
Lexical analysis programs are generally called lexical analyzer or scanner.
Syntax analysis is a logical stage in the compilation process; The task of this stage is to combine word sequences into various grammatical phrases and expressions based on lexical analysis. The parser is generally called parser.
The essence of SQL parsing is language conversion, which is to convert text codes into data structures that can be described by computer language.
Lexical compiler is a tool for generating lexical analyzer (scanner), which uses regular expressions to describe each lexical unit.
Yacc is a classic tool for generating parser, which adopts bottom-up (LALR) parsing method. All the syntax of any programming language can be translated into yacc syntax parser for that language. However, the generated code is generally obscure and difficult to understand.
Lex and yacc can be used together.
ANTLR is a syntax parser generator written in Java and implemented based on top-down recursive descent ll algorithm.
The result of SQL parsing is to generate an abstract syntax tree ast (abstract syntax tree).
The ast will be parsed into a logical plan, including the data processing logic and sequence written by the user.
Logical processing order refers to how an SQL statement should be executed and when each keyword and clause part should be executed.
It should be noted that the execution order of SQL is not from top to bottom and from left to right in writing order.
The actual data processing sequence of a standard SQL is listed below:
From < name of left Table >
On < join condition >
< type of join > < name of right table >
Where < where conditions >
Group by < field of group by >
Having < having condition >
Distinct < field to query >
Order by < condition of order by >
Limit < number of limit >
It can be seen that the order of operations is to first select the table to be operated, then filter with on and where conditions, then reorganize the partition according to the business, and finally sort.
Due to the different capabilities of users, the SQL scripts written by the same business may vary greatly, which may lead to that the logical plan may not be the optimal execution path. So the logical plan needs to be optimized.
Databases generally have two kinds of query optimizers:
RBO: rule based optimization
The optimizer determines the execution plan of SQL according to a series of rules hard coded in the database. For example, when querying, the priority of index is higher than that of full table scanning, which belongs to RBO optimization, and predicate push down also belongs to RBO optimization.
CBO: cost based optimization
The optimizer generates multiple execution plans by converting relational expressions according to optimization rules, and then CBO will calculate the “cost” of various possible “execution plans”, that is, cost, according to statistics and cost model, and select the execution scheme with the lowest cost as the actual operation scheme.
Statistics include table data volume, io of execution path, network IO, and CPU usage. Pre analysis of tables is to pre count the distribution and data volume of tables, which belongs to CBO optimization.
After the logical plan is optimized by RBO and CBO, an execution plan will be generated.
In distributed database, the execution plan can be executed in parallel, and the relationship between tasks can be regarded as a directed acyclic graph DAG.
The execution plan in the figure above is divided into four subtasks, and each task can be submitted to one or more nodes for execution.
When submitting SQL, the user is attached with personal identity authentication information. Permission control will verify whether the current user has corresponding DDL, DCL, DML and other permissions in SQL.
After the permission is approved, the task will be sent to the task queue of each execution node. Each task needs to apply for the corresponding CPU and memory resources before it can be really executed.
Under the large data volume, the consumption of CPU and memory resources is very huge, which is why the concurrent query of large data is sometimes as slow as a snail.
As mentioned earlier, the relationship between distributed tasks can be regarded as a directed acyclic graph DAG. After each sub task is completed, the result will continue to be executed as the input of the next task until the top-level task is completed.
Generally speaking, the lowest level tasks (with tablescan operator) are IO intensive and need to read the required data from the disk. Intermediate tasks (with filter or join operator) need CPU to judge or do a lot of calculations such as Cartesian product operation, while some sorting operations require a lot of memory.
When all tasks are completed and summarized to the root node, the server will return the SQL execution results to the client.
The above is to analyze the parsing and execution process of an SQL,“Distributed technology topic” is carefully compiled by the domestic database Hubble team. The topic will be continuously updated. You are welcome to keep your attention.