Analysis of the underlying operation principle of SQL query

Time:2021-7-22

SQL language everywhere。 SQL is not only the exclusive skill of technical personnel, it seems that everyone can write SQL, just as everyone is a product manager. If you do background development, crud is a routine. If you do data warehouse development, writing SQL may occupy most of your working time. When we understand the select syntax, we also need to understand the underlying principle of select execution. Only in this way can we have a deeper understanding of SQL. This article will gradually decompose the execution process of SQL, hoping to help you.

Data preparation

The purpose of this paper is to explain the execution process of SQL query. It does not involve too complex SQL operations. It mainly involves two tablescitizenandcityThe specific data are as follows:

CREATE TABLE citizen ( 
    name CHAR ( 20 ), 
    city_id INT ( 10 ) 
);


CREATE TABLE city (
    city_id INT ( 10 ), 
    city_name CHAR ( 20 ) 
);

INSERT INTO city
VALUES
    (1, "Shanghai"),
    (2, "Beijing"),
    (3) "Hangzhou");
    
    
INSERT INTO citizen
VALUES
("tom",3),
("jack",2),
("robin",1),
("jasper",3),
("kevin",1),
("rachel",2),
("trump",3),
("lilei",1),
("hanmeiei",1);

Query execution order

The query statements involved in this paper are as follows: join the city table and the city table, and then filter out the city_ name != ” Shanghai “data, and then according to the city_ Name groups the cities whose total number of people in each city is more than 2. The details are as follows:

Query statement

SELECT 
    city.city_name AS "City",
    COUNT(*) AS "citizen_cnt"
FROM citizen
  JOIN city ON citizen.city_id = city.city_id 
WHERE city.city_ name != ' 'shanghai '
GROUP BY city.city_name
HAVING COUNT(*) >= 2
ORDER BY city.city_name ASC
LIMIT 2

Implementation steps

The writing order of the above SQL query statement is as follows:

SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY ...

But the execution sequence is not like this. The specific execution sequence is as follows:

  • 1. Get data(From, Join)
  • 2. Filtering data(Where)
  • 3. Grouping(Group by)
  • 4. Group filtering(Having)
  • 5. Return to the query field(Select)
  • 6. Sorting and pagination(Order by & Limit / Offset)

Scream tip: This paper aims to explain the general underlying principle of SQL execution, and does not consider its optimization technology, such as predicate push down, projection push down, and so on.

The underlying principles of execution

In fact, the SQL execution order mentioned above is the so-called underlying principle. When we execute a select statement, each step will produce aVirtual tableThe virtual table is used as input in the next step. Note that these processes are transparent to users.

You can notice that select starts from the step from. In this stage, if multiple tables are used to join, the following steps will be followed:

Get data(From, Join)

  • First, the Cartesian product is obtained by cross join, which is equivalent to the virtual table vt1-1;
  • Then, the virtual table vt1-1 is used as input and the virtual table vt1-2 is output;
  • Add external rows. If we use left join, right link or full join, external rows will be involved, that is, external rows will be added to virtual table vt1-2 to get virtual table vt1-3

Filtering data(Where)

After the above steps, we get a final virtual table VT1, on which we use where filter to filter out the data that does not meet the conditions, so as to get the virtual table vt2.

Grouping(Group by)

After where filtering, we get vt2. Next, the group by operation is performed to get the middle virtual table VT3.

Group filtering(Having)

On the basis of virtual table VT3, we use having to filter out the aggregate data that does not meet the conditions, and get vt4.

Return to query field(Select)

When we have finished the condition filtering, we can filter the fields extracted from the table, that is, enter the select and distinct phases. First, the target fields are extracted in the select phase, and then the duplicate rows are filtered out in the distinct phase to get the middle virtual tables vt5-1 and vt5-2, respectively.

Sorting and pagination(Order by & Limit / Offset)

After we extract the desired field data, we can sort according to the specified field, that is, the order by stage, and get the virtual table VT6. Finally, on the basis of VT6, take out the record of the specified row, that is, the limit stage, and get the final result, which corresponds to the virtual table vt7

Detailed implementation step analysis

Step 1: get data(From, Join)

FROM citizen
JOIN city 

The first step in this process is to execute the statements in the from clause and then the join clause. The result of these operations is the Cartesian product of two tables.

name city_id city_id city_name
tom 3 1 Shanghai
tom 3 2 Beijing
tom 3 3 Hangzhou
jack 2 1 Shanghai
jack 2 2 Beijing
jack 2 3 Hangzhou
robin 1 1 Shanghai
robin 1 2 Beijing
robin 1 3 Hangzhou
jasper 3 1 Shanghai
jasper 3 2 Beijing
jasper 3 3 Hangzhou
kevin 1 1 Shanghai
kevin 1 2 Beijing
kevin 1 3 Hangzhou
rachel 2 1 Shanghai
rachel 2 2 Beijing
rachel 2 3 Hangzhou
trump 3 1 Shanghai
trump 3 2 Beijing
trump 3 3 Hangzhou
lilei 1 1 Shanghai
lilei 1 2 Beijing
lilei 1 3 Hangzhou
hanmeiei 1 1 Shanghai
hanmeiei 1 2 Beijing
hanmeiei 1 3 Hangzhou

After the execution of from and join, the required rows will be filtered according to the on condition of join

ON citizen.city_id = city.city_id
name city_id city_id city_name
tom 3 3 Hangzhou
jack 2 2 Beijing
robin 1 1 Shanghai
jasper 3 3 Hangzhou
kevin 1 1 Shanghai
rachel 2 2 Beijing
trump 3 3 Hangzhou
lilei 1 1 Shanghai
hanmeiei 1 1 Shanghai

Step 2: filtering data(Where)

After you get the row that meets the condition, it is passed to the where clause. This evaluates each row using a conditional expression. If the row does not evaluate to true, it is removed from the collection.

WHERE city.city_ name != ' 'shanghai '
name city_id city_id city_name
tom 3 3 Hangzhou
jack 2 2 Beijing
jasper 3 3 Hangzhou
rachel 2 2 Beijing
trump 3 3 Hangzhou

Step 3: grouping(Group by)

The next step is to execute the group by clause, which groups the rows with the same value. After that, all select expressions are evaluated by group, not by row.

GROUP BY city.city_name
GROUP_CONCAT(citizen.name) city_id city_name
jack,rachel 2 Beijing
tom,jasper,trump 3 Hangzhou

Step 4: Group Filtering(Having)

The grouped data is filtered by the predicates contained in the having clause

HAVING COUNT(*) >= 2

Step 5: return to query field(Select)

In this step, the processor will evaluate what the query result will print and whether there are some functions to run on the data, such as distinct, Max, sqrt, date, lower, etc. In this case, the select clause only prints the city name and the count (*) value of its corresponding group, and uses the identifier “city” as the city_ The alias of the name column.

SELECT 
    city.city_name AS "City",
    COUNT(*) AS "citizen_cnt"
city citizen_cnt
Beijing 2
Hangzhou 3

Step 6: sorting and pagination(Order by & Limit / Offset)

The final processing step of the query involves the sorting of the result set and the output size. In our example, we arrange it in ascending alphabetical order and output two data results.

ORDER BY city.city_name ASC
LIMIT 2
city citizen_cnt
Beijing 2
Hangzhou 3

summary

This paper mainly analyzes the execution order and underlying principle of SQL statements, the basic SQL query will be divided into six steps. Combined with specific examples, this paper gives the detailed results of each step, so as to have a deeper understanding of the underlying principles of its implementation.

Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.