How to convert external links to internal links?

Time:2021-9-20

Summary:If you can convert an outer join to an inner join, you can simplify the query optimization process.

This article is shared from Huawei cloud community《Conversion of gaussdb (DWS) external connection to internal connection》, author: Wang Xiaojuan 8.

In the process of query optimization, the connection order between internally connected tables can be exchanged at will. The conditions involving only a single table in the where or on conditions can be pushed down to the table as the filter conditions of the table; For external connections, the connection order of tables cannot be exchanged at will, and the constraints cannot be pushed down at will. If you can convert an outer join to an inner join, you can simplify the query optimization process.

Conditions for external connection to be converted to internal connection

For convenience of description, two nouns are introduced:

  • Not empty side: the side where all data in the external connection is output. For example, the left table of the left external connection and the right table of the right external connection
  • Empty side: the side of the external connection that will be filled with empty values. For example, the right table of left external connection, the left table of right external connection, and the left table and right table of all external connections

You can convert an external connection to an internal connection as long as one of the following conditions is met:

  • There is a “strict” constraint in the where condition, and the constraint refers to the columns in the table on the air side. In this way, the predicate can filter out the null values generated by the null side, so that the final result is equal to the inner connection.

Refer to the definition of “strict” in PostgreSQL technology insider – in-depth exploration of query optimization as follows:

The precise definition of “strict” is that for a function, operator or expression, if the input parameter is null, the output must also be null. It can be said that the function, operator or expression is strict; But broadly speaking, for a function, operator or expression, if the input parameter is null and the output result is null or false, it is considered that the function or operator is strict. If there is such a strict operator, function or expression in the constraints, because the input is null and the output is null or false, tuples containing null values will be filtered out.

This article followsIntroduction to connection types of guassdb (DWS)The tables and data in this paper are illustrated.

Example 1: query the math scores of students whose math scores are not empty

postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null;
 id | name  | score 
----+-------+-------
  1 | Tom   |    80
  2 | Lily  |    75
  4 | Perry |    95
(3 rows)
postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null;
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
  id |               operation                | E-rows | E-width | E-costs 
 ----+----------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)           |     30 |     126 | 36.59   
   2 |    ->  Hash Join (3,4)                 |     30 |     126 | 28.59   
   3 |       ->  Seq Scan on student s        |     30 |     122 | 14.14   
   4 |       ->  Hash                         |     29 |       8 | 14.14   
   5 |          ->  Seq Scan on math_score ms |     30 |       8 | 14.14   
 
 Predicate Information (identified by plan id)
 ---------------------------------------------
   2 --Hash Join (3,4)
         Hash Cond: (s.id = ms.id)
   5 --Seq Scan on math_score ms
         Filter: (score IS NOT NULL)
(14 rows)

Example 2: query the math scores of students whose math scores are higher than 80

postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80;
 id | name  | score 
----+-------+-------
  4 | Perry |    95
(1 row)
postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80;
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
  id |               operation                | E-rows | E-width | E-costs 
 ----+----------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)           |     10 |     126 | 36.44   
   2 |    ->  Hash Join (3,4)                 |     10 |     126 | 28.44   
   3 |       ->  Seq Scan on student s        |     30 |     122 | 14.14   
   4 |       ->  Hash                         |     10 |       8 | 14.18   
   5 |          ->  Seq Scan on math_score ms |     10 |       8 | 14.18   
 
 Predicate Information (identified by plan id)
 ---------------------------------------------
   2 --Hash Join (3,4)
         Hash Cond: (s.id = ms.id)
   5 --Seq Scan on math_score ms
         Filter: (score > 80)
(14 rows)

In the above two examples, the conditions where ms.score is not null and where ms.score > 80. If the entered score is null, the constraint returns false, meeting the broad “strict” definition. Therefore, external connections can be eliminated and converted into internal connections. The query plan above has also been verified. Moreover, this external connection elimination can be automatically processed by the query optimizer of the database.

  • In the on connection condition, if the values in the non empty side column are a subset of the non empty side column, and the values on the non empty side are not null. Typically, the columns on the non empty side are foreign keys, but the columns on the empty side are primary keys, and there is a primary foreign key reference relationship between them.
CREATE TABLE student(
  id INTEGER primary key,
  name varchar(50)
);

CREATE TABLE math_score(
  ID integer, -- because gaussdb (DWS) does not support foreign keys, the foreign key definition is omitted here, but the value of this column is guaranteed to be a subset of the ID column in the student table
  score INTEGER
);

INSERT INTO student VALUES(1, 'Tom');
INSERT INTO student VALUES(2, 'Lily');
INSERT INTO student VALUES(3, 'Tina');
INSERT INTO student VALUES(4, 'Perry');

INSERT INTO math_score VALUES(1, 80);
INSERT INTO math_score VALUES(2, 75);
INSERT INTO math_score VALUES(4, 95);

The results of the following external connection are the same as those of the internal connection:

postgres=# select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id);
 id | name  | score
----+-------+-------
  1 | Tom   |    80
  2 | Lily  |    75
  4 | Perry |    95
(3 rows)

postgres=# explain select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id);
                               QUERY PLAN                               
-------------------------------------------------------------------------
  id |              operation              | E-rows | E-width | E-costs
 ----+-------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)        |     30 |     126 | 36.59  
   2 |    ->  Hash Left Join (3, 4)        |     30 |     126 | 28.59  
   3 |       ->  Seq Scan on math_score ms |     30 |       8 | 14.14  
   4 |       ->  Hash                      |     29 |     122 | 14.14  
   5 |          ->  Seq Scan on student s  |     30 |     122 | 14.14  

 Predicate Information (identified by plan id)
 ---------------------------------------------
   2 --Hash Left Join (3, 4)
         Hash Cond: (ms.id = s.id)
(12 rows)

postgres=# select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id);
 id | name  | score
----+-------+-------
  1 | Tom   |    80
  2 | Lily  |    75
  4 | Perry |    95
(3 rows)

postgres=# explain select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id);
                                 QUERY PLAN                                
----------------------------------------------------------------------------
  id |               operation                | E-rows | E-width | E-costs
 ----+----------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)           |     30 |     126 | 36.59  
   2 |    ->  Hash Join (3,4)                 |     30 |     126 | 28.59  
   3 |       ->  Seq Scan on student s        |     30 |     122 | 14.14  
   4 |       ->  Hash                         |     29 |       8 | 14.14  
   5 |          ->  Seq Scan on math_score ms |     30 |       8 | 14.14  

 Predicate Information (identified by plan id)
 ---------------------------------------------
   2 --Hash Join (3,4)
         Hash Cond: (s.id = ms.id)
(12 rows)

Because gaussdb (DWS) does not support foreign keys, this condition that can convert external connection elimination to internal connection cannot be recognized by the optimizer and can be automatically converted, but it can help developers manually identify and eliminate external connections when writing SQL daily.

An interesting rewriting example

There is a use case as follows:

Select count(1)
from student s left join math_score ms on (s.id = ms.id)
where s.id = 2
and ms.score > 70;

postgres=# Select count(1)
postgres-#  from student s left join math_score ms on (s.id = ms.id)
postgres-# where s.id = 2
postgres-# and ms.score > 70;
 count 
-------
     0
(1 row)

postgres=# explain Select count(1)
postgres-#  from student s left join math_score ms on (s.id = ms.id)
postgres-# where s.id = 2
postgres-# and ms.score > 70;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
  id |                            operation                            | E-rows | E-width | E-costs 
 ----+-----------------------------------------------------------------+--------+---------+---------
   1 | ->  Aggregate                                                   |      1 |       8 | 26.51   
   2 |    ->  Streaming (type: GATHER)                                 |      1 |       8 | 26.51   
   3 |       ->  Aggregate                                             |      1 |       8 | 22.51   
   4 |          ->  Nested Loop (5,6)                                  |      3 |       0 | 22.49   
   5 |             ->  Index Only Scan using student_pkey on student s |      1 |       4 | 8.27    
   6 |             ->  Seq Scan on math_score ms                       |      1 |       4 | 14.21   
 
     Predicate Information (identified by plan id)    
 -----------------------------------------------------
   5 --Index Only Scan using student_pkey on student s
         Index Cond: (id = 2)
   6 --Seq Scan on math_score ms
         Filter: ((score > 70) AND (id = 2))
(15 rows)

As can be seen from the above plan, the left outer connection in SQL has been optimized as a cross connection because there is no join condition on operator nest loop No. 4.

Finally, this query is to calculate the total row number of two table join results. For cross joins, the number of rows joined by two tables is equal to the product of the number of rows in the left table and the number of rows in the right table. Therefore, this query can be modified to the following equivalent query:

explain select lcount * rcount as count
from (select count(1) lcount from student where id = 2) s,
     (select count(1) rcount from math_score where score > 70 and id = 2) ms;

postgres=# select lcount * rcount as count
postgres-# from (select count(1) lcount from student where id = 2) s,
postgres-#      (select count(1) rcount from math_score where score > 70 and id = 2) ms;
 count 
-------
     1
(1 row)

postgres=# explain select lcount * rcount as count
postgres-# from (select count(1) lcount from student where id = 2) s,
postgres-#      (select count(1) rcount from math_score where score > 70 and id = 2) ms;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
  id |                            operation                             | E-rows | E-width | E-costs 
 ----+------------------------------------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)                                     |      1 |      16 | 26.56   
   2 |    ->  Nested Loop (3,7)                                         |      1 |      16 | 22.56   
   3 |       ->  Aggregate                                              |      1 |       8 | 8.29    
   4 |          ->  Streaming(type: BROADCAST)                          |      1 |       8 | 8.29    
   5 |             ->  Aggregate                                        |      1 |       8 | 8.28    
   6 |                ->  Index Only Scan using student_pkey on student |      1 |       0 | 8.27    
   7 |       ->  Materialize                                            |      1 |       8 | 14.25   
   8 |          ->  Aggregate                                           |      1 |       8 | 14.23   
   9 |             ->  Streaming(type: BROADCAST)                       |      1 |       8 | 14.23   
  10 |                ->  Aggregate                                     |      1 |       8 | 14.22   
  11 |                   ->  Seq Scan on math_score                     |      1 |       0 | 14.21   
 
    Predicate Information (identified by plan id)   
 ---------------------------------------------------
   6 --Index Only Scan using student_pkey on student
         Index Cond: (id = 2)
  11 --Seq Scan on math_score
         Filter: ((score > 70) AND (id = 2))
(20 rows)

Through this rewriting, the aggregation operation can be pushed to each subtree of the nested loop. When the amount of data in each subtree of the nested loop is relatively large, the aggregation can greatly reduce the result set and the amount of data participating in the join, so as to improve the performance.

The following examples are left for reflection:

Select sum(score)
From student s left join math_score ms on (s.id = ms.id)
Where s.id = 2
  And ms.score > 70;

Click focus to learn about Huawei cloud’s new technologies for the first time~