Introduction:This article mainly introduces what is a correlated subquery and how to rewrite the correlated subquery into a SQL query with ordinary semantics.
This article mainly introduces what is a correlated subquery and how to rewrite the correlated subquery into a SQL query with ordinary semantics.
In the background introduction we will talk about the semantics of common correlated subqueries, the benefits of correlated subquery syntax, and the challenges to database systems when it is executed. In the second chapter, we will mainly introduce how to rewrite the correlated subquery into a common query form, that is, disassociation. In Chapter 3 we will introduce optimization methods in disassociation.
A background introduction
A correlated subquery refers to a subquery that is associated with an outer query. Specifically, the column contained in the outer query is used in this subquery.
Because of this flexibility of using associated columns, writing SQL queries in the form of sub-queries can often greatly simplify SQL and make the semantics of SQL queries easier to understand. Below we illustrate this with a few examples using the tpch schema. tpch schema is a typical database of order system, including customer table, orders table, lineitem table, etc., as shown below:
If we want to query "all the information of customers who have never placed an order", then we can use the associated subquery as a filter condition. The sql written out using the correlated subquery is as follows. You can see that the not exists subquery here uses the column c_custkey outside the column.
-- Information of all customers who have never placed an order select c_custkey from customer where not exists ( select * from orders where o_custkey = c_custkey )
If it is not written in the above form, we need to consider the left join of the customer and orders tables first, and then filter out the rows that are not joined. At the same time, we also need each row of the markorder, so that those that are originally null. The query sql is as follows:
-- Information of all customers who have never placed an order select c_custkey from customer left join ( select distinct o_custkey from orders ) on o_custkey = c_custkey where o_custkey is null
From this simple example, you can see that the use of correlated subqueries reduces the difficulty of sql writing while improving readability.
In addition to using correlated columns in exists/in subqueries, correlated subqueries can also appear in where as the values required for filter conditions. For example, in tpch q17, a subquery is used to obtain an aggregate value as a filter condition.
-- tpch q17 SELECT Sum(l1.extendedprice) / 7.0 AS avg_yearly FROM lineitem l1, part p WHERE p.partkey = l1.partkey AND p.brand = 'Brand#44' AND p.container = 'WRAP PKG' AND l1.quantity < (SELECT 0.2 * Avg(l2.quantity) FROM lineitem l2 WHERE l2.partkey = p.partkey);
In addition to appearing in where, correlated subqueries can appear anywhere a single row (scalar) is allowed, such as in a select list. If we need to do a report to summarize some customer information, and want to query the total order amount for each customer, we can use the following sql containing a correlated subquery.
-- Customers and the corresponding total consumption select c_custkey, ( select sum(o_totalprice) from orders where o_custkey = c_custkey ） from customer
For a more complex example, we want to query each customer and the corresponding total amount of orders that have been signed before a certain date. Using correlated subqueries requires only a few minor changes as follows:
select c_custkey, ( select sum(o_totalprice) from orders where o_custkey = c_custkey and '2020-05-27' > ( select max(l_receiptdate) from lineitem where l_orderkey = o_orderkey ) ） from customer
After reading these examples, I believe that everyone has felt the convenience brought by using correlated subqueries. But at the same time correlated subqueries also bring implementation challenges. In order to compute the value of the associated result (the output of the subquery), an iterative implementation is required.
Take the previously discussed tpch 17 as an example:
SELECT Sum(l1.extendedprice) / 7.0 AS avg_yearly FROM lineitem l1, part p WHERE p.partkey = l1.partkey AND p.brand = 'Brand#44' AND p.container = 'WRAP PKG' AND l1.quantity < (SELECT 0.2 * Avg(l2.quantity) FROM lineitem l2 WHERE l2.partkey = p.partkey);
The subquery part here uses the column p.partkey of the outer query.
SELECT 0.2 * Avg(l2.quantity) FROM lineitem l2 WHERE l2.partkey = p.partkey -- p.partkey is the column of the outer query
The optimizer represents this query as a logical tree as shown below:
If the database system does not support viewing the logical tree, you can use the explain command to view the physical plan. The general output is as follows:
+---------------+ | Plan Details | +---------------+ 1- Output[avg_yearly] avg_yearly := expr 2 -> Project expr := (`sum` / DOUBLE '7.0') 3 - Aggregate sum := `sum`(`extendedprice`) 4 -> Filter[p.`partkey` = l1.`partkey` AND `brand` = 'Brand#51' AND `container` = 'WRAP PACK' AND `quantity` < `result`] 5 - CorrelatedJoin[[p.`partkey`]] 6 - CrossJoin 7 - TableScan[tpch:lineitem l1] 8 - TableScan[tpch:part p] 9 - Scalar 10 -> Project result := (DOUBLE '0.2' * `avg`) 11 - Aggregate avg := `avg`(`quantity`) 12 -> Filter[(p.`partkey` = l2`partkey`)] 13 - TableScan[tpch:lineitem l2]
We call the operator that connects the outer query and the subquery CorrelatedJoin (also known as lateral join, dependent join, etc.). Its left subtree is called the outer query (input), and the right subtree is called the subquery. (subquery). The outer column that appears in the subquery is called the associated column. In the chestnut, the associated column is p.partkey.
The corresponding logical plan and related definitions in the example are shown in the following figure. The rows 6-8 in the explain return result are external queries, rows 9-13 are sub-queries, and the associated part is the filter in row 12 of the sub-query.
The output of this operator is equivalent to the result of an iterative execution. That is, the value of each row of the associated column of the left subtree is brought into the right subtree for calculation and a row of results is returned. Somewhat similar to the subquery as a user defined function (udf), the value of the associated column of the outer query is used as the input parameter of this udf. It should be noted that we need the subquery to be deterministic, that is, for the associated columns with the same value, the results returned by each subquery run should be deterministic.
In the chestnut in the above figure, if the outer query has a row with a p.partkey value of 25, then the output of the correlatedjoin corresponding to this row is the result of the following query:
-- When p.partkey = 25, the corresponding subquery is SELECT 0.2 * Avg(l2.quantity) FROM lineitem l2 WHERE l2.partkey = 25
It should be noted that if the result of the evaluation is an empty set, a row of null is returned. And if running the subquery returns more than one row of results, a runtime error should be reported. In the logical plan, use the node enforcesinglerow to constrain.
It can be found from the above introduction that the CorrelatedJoin operator breaks the previous top-down execution mode of the logic tree. Ordinary logical trees are executed from the leaf node to the root node, but the right subtree of CorreltedJoin will be repeatedly executed with the value of the row brought into the left subtree.
The output of correlatedjoinnode is to add a column to the result of the outer query, but it can be seen that the complexity of this iterative execution method is the same as the complexity of the cross join on the part of the tree before the outer query and the subquery are associated.
At the same time, such an iterative execution method is a big challenge for distributed database systems. Because the logic of scheduling at execution time needs to be modified. And we can see that if such an execution method does not cache the results, a lot of repeated results will be calculated.
The optimization rules of the traditional optimizer do not specifically deal with the Correlatedjoin node. In order to support the iterative form of correlated subquery, the Correlatedjoin will be equivalently converted in the initial stage of the optimizer, and the converted logical tree will be joined using join. Common operators such as aggregation are used to calculate the results of correlated subquery. This process is called decorrelation/unnesting. In the next chapter, we mainly introduce the common ways of disassociation.
Two common disassociation methods
For convenience, we only discuss scalar correlated subqueries in this chapter, which are correlated subqueries that output a column of values.
Before discussing how to disassociate, let us summarize that the output of correlated subqueries has the following characteristics:
- The computed result of the correlated join operator is to add a column to the outer query.
- The result of the added column is calculated by bringing the value of the associated column of the outer query into the subquery. When the calculation result exceeds one line, an error is reported, and if the calculation result is empty, null is added.
- Unlike join operators, correlated joins do not change other columns of the outer query (and do not bloat many rows).
The key to disassociating is to make the subquery obtain the value of the corresponding row of the outer query.
In the plan, it is to push the correleted join operator down to the right below the part where the association is generated. When the left and right subtrees of the correlated join operator have no associated columns, the correlated join operator can be converted into a join operator. In this way, the subquery obtains the value of the associated column by joining with the outer query, so that it can be calculated from top to bottom and return to the original calculation method. As shown in the figure below, the rest subquery in the figure below is the subquery part before the association generation part. When the correlated join is pushed below the part where the correlation is generated, it can be converted into a normal join.
Those operators pushed by correlated join need to be rewritten to maintain equivalence (subquery becomes subquery' in the chestnut in the above figure).
1 Pushdown rule
The paper Orthogonal Optimization of Subqueries and Aggregation  gives equivalent transformation rules for pushing down correlatedjoin_ operators below other operators (filter, project, aggregation, union, etc.). However, the correlatedjoin_ operator in the paper will filter the number of rows of the outer query, similar to the inner join (called in the paper). Here we discuss the more general relatedjoin (called in the paper) similar to left join, and discuss what rewrites need to be done to ensure that the number of rows in the outer query is not filtered.
Due to space limitations, below we only introduce the equivalence rules that are pushed down to filter, project, and aggregation operators.
For simplicity, we have removed enforcesinglerow from the logical tree.
Convert 1 Convert to join when not associated
Recalling the above, the left subtree of the correlated join operator is input, and the right subtree is subquery. When the left and right subtrees of the correlated join are not related, the result of the subquery is the same for each row of the outer query at this time.
We can convert the correlated join into a normal leftjoin operator without join criteria.
Note: Enforcesinglerow needs to be added to the subquery to ensure that the join semantics are the same as correlatedjoin (without causing input inflation).
Convert 2 to join when simple associative condition
When the uppermost node in the right subtree of a correlated join is an associated filter and the lower one is not associated, the filter can be directly placed in the condition of the left join, or it can be understood as a filter lift. As shown below:
Transform 3 pushes down through the filter
Correlatedjoin* in the paper can be directly pushed through the filter. If the correlated join needs to be pushed down, the filter needs to be rewritten and rewritten as a project with case when. It should output null when the row of the subquery does not satisfy the filter's condition.
Transition 4 Push down through the project
The correlated join pushes down the project, and the output column of the input needs to be added to the project.
Transition 5 pushes down through aggregation
When the correlated join is pushed down to the aggregation with group by, the aggregation needs to be rewritten.
Rewritten to add all columns of the outer query to the group by column of aggregation. This requires that the external query must have a key, if not, a temporary key needs to be generated. The operator that can be generated is the assignuniqueid operator in the graph.
If aggregation is global, additional processing is required. As shown below:
When the correlated join is pushed down to the global aggregation, the input column (and key) needs to be added to the aggregation as the group by column. This pushdown rule also requires a premise, that is, the aggregation function needs to satisfy the characteristic agg(Ø)=agg(null). This means that the aggragtion function needs to evaluate the empty set and null the same.
Note: In the syntax of mysql and AnalyticDB for MySQL (a cloud native data warehouse developed by Alibaba Cloud , compatible with mysql syntax, hereinafter referred to as ADB), sum, avg, etc. do not meet this feature. The average of an empty set is 0, and taking the average of any set containing null values results in a null not 0. Therefore, each row in the mark subquery needs to be specially treated for the empty set, which will not be explained here.
The paper Orthogonal Optimization of Subqueries and Aggregation  repeatedly uses the above rules to push down correlatedjoins until correlatedjoins can be converted into ordinary joins.
Bringing it into the chestnut of the previous tpch q17, we first push the correlated join to the project in the subquery, and the query becomes:
Then push down through this agg and rewrite this agg, as shown below:
Here we ignore avg(Ø)!=avg(null). If you consider this situation, you need to mark all the rows in the subquery, and after the correlated join, the empty set is specially processed according to the result of the subquery combined with the value of mark (the value of the marked row is changed from null to 0). Interested readers can refer to the final plan for q17 in the next sheet.
Then directly call the previous rule 2, and mention this filter. In this way, the query becomes a normal unrelated query.
2 Result multiplexing
Recalling what was said in the previous section, the query result of the subquery is calculated after bringing in the value of the associated column of each row, so it is obvious that the result calculated by the associated column with the same value into the subquery is exactly the same. In the example above, for the same p.partkey, the results of the subqueries output by correlatedjoin are equal. As shown in the following figure, if the external query partkey is 25, the associated subquery generated is exactly the same, so the result is naturally the same.
In 2015, Newmann's paper Unnesting Arbitrary Queries introduced a method which is to first take distinct from the associated column in the external query, and then perform a left join between the value returned by the correlated join and the original external query according to the associated column, as shown in the following figure :
The condition of not distinct join here corresponds to <=> in mysql, and the result of null<=>null is true, which can be joined together.
In the previous example, as shown in the figure below, the associated column partkey of the external query is first distinguished, and then the calculation result of the subquery is brought in, and finally the corresponding result is connected to the original external query through join.
If the above transformation is done, then we can consider that the associated column of the new input is always distinct. And now the relatedjoin* operator can allow input columns to be filtered. In addition to not performing repeated sub-query calculations for the same column, the advantages of this are mainly the following two:
- The new outer query will always have the key, because the distinct has passed.
- The correlatedjoin* operator is simpler to push down because it filters the columns of the outer query (no assignuniqueid required, no need to keep all rows).
After the above conversion, and then applying the previous equivalent pushdown rule, we can push correlatedjoin* down to a place where the left and right subtrees are not related, thus rewriting it as inner join.
If disassociation is performed according to the method of Unnesting Arbitrary Queries , a part of the input result needs to be multiplexed, and this multiplexing requires the support of the execution engine. It should be noted that when the system does not support multiplexing, we need to execute the input subtree twice (as shown in the figure below). At this time, the result of the input subtree needs to be deterministic, otherwise the solution cannot be solved by this method. association.
Optimization of three-correlation subquery
In the ADB optimizer, the logical plan will be matched and transformed according to each transformation rule, which means that after the association is dissociated, it is not necessary to care about the efficiency of the plan generated by the disassociation, and it will be directly handed over to the subsequent optimization rules. But the reality is not so good, because the follow-up rules are not complete, and the relationship between the outer query and the subquery is lost after disassociation, we hope to optimize the plan as much as possible when disassociating.
1 exists/in/filter associated subquery
In the previous chapters, for simplicity, we only discussed scalar subqueries. Because exists/in these subqueries can be rewritten as scalar subqueries. For example, rewrite exists to count(*) > 0
However, it can be seen that if the return result of the subquery is used to filter the rows of the outer query, it will actually simplify the whole process of disassociation. Therefore, we perform special processing on subqueries such as exists/in, and distinguish them during grammar parsing. In the process of disassociation, if you can use the semijoin/antijoin operator to disassociate, disassociate directly, otherwise it will be converted into a scalar subquery, that is, the form of correlatedjoin.
2 Raising the associated conditions
Seeing this, you will find that with the push down of correlatedjoin, the logical tree will become more complex, so we will carry out the lifting of the correlation operator inside the subquery before pushing down. When the logic is that the higher the operator that generates the association, the correlatedjoin can be pushed to the lower part of the association earlier. For example the following query:
SELECT t1.c2 FROM t1 WHERE t1.c2 < ( SELECT 0.2 * max(t2.x) FROM t2 WHERE t2.c1 = t2.c1 GROUP BY t2.c1 );
Here, since t2.c1 = t2.c1 can be pushed to agg (because this is a condition on the group by column for the subquery), we can perform the following transformation. First lift the associated filter (sometimes it needs to be rewritten), so just push the correlatedjoin through the filter and call transformation 2.
A more specific example is the tpch q17 mentioned above. The case where the scalar subquery is used in the filter condition can also be further rewritten.
The following figure shows the logical plan after pushing down the correlated join and rewriting it to the left join according to the theory mentioned earlier.
Since this scalar subquery is used as a filter condition, in this case, the outer query corresponding to the subquery returning null as no result will definitely be filtered out. So correlatedjoin can be directly converted to correlatedjoin*, plus the filter is lifted, we can get the following plan. The advantage of this rewriting is that agg (early agg) can be performed before join. The downside is that if you do not handle it carefully, it is easy to cause semantic inequality and count bugs.
3 Cost-dependent subquery optimization
Use the window operator to disassociate
Looking back on what we have said so far, the most impressive thing is that the correlatedjoin operator adds a column to the outer query. And his behavior is similar to the window operator. The semantics of the window operator is not to change the number of input lines, but to add a value calculated in the window frame to each line. So we can use the window operator for disassociation. If you are interested, you can refer to these two papers Enhanced Subquery Optimizations in Oracle and WinMagic: Subquery Elimination Using Window Aggregation.
The rewrite of window disassociation is that when the outer query contains all the tables and conditions in the subquery, you can directly use window to splicing the results of the subquery to the outer query. His advantage is that it saves a lot of tablescan. For example tpch q2. The following can be rewritten:
This can be rewritten as window because the outer query contains all the tables and conditions of the inner query. And the agg function min also satisfies the feature agg(Ø)=agg(null) (if not, you need to mark the line and rewrite the output with case when).
It can be seen that the number of tablescans is greatly reduced after rewriting. Further, the optimization rules behind the optimizer will perform the order exchange of join and window according to the primarykey information and the characteristics of the agg function to further reduce the amount of data input by the window operator (filter-join pushdown).
These benefits are covered in many articles. Let's discuss the bad aspects of this rewriting here:
- For example, when pk fails to show that the function that provides /agg is sensitive to duplicates, the window operator will block the push-down of filter-join, thus interrupting the joingraph and causing the intermediate result of join to become larger.
- If rewritten as a join of two subtrees, filter-join can be pushed down to one of the subtrees. After the window is rewritten, filter-join cannot be pushed down.
- Under the execution model of the pipeline/& using cte, the benefits obtained by scanning the table are limited.
- The traditional optimizer's optimization processing/optimization rules for join&agg are much better/richer than for window.
In summary, when to use window to rewrite correlated subqueries requires estimation of benefits and costs.
Pushdown of CorrelatedJoin in outer query
Before pushing the correlatedJoin down to the subquery direction, we will push down the correlatedjoin in the outer query (such as pushing the cross join, etc.).
This is done because correlatedJoin will never cause data bloat, so should theoretically be done earlier. But in fact, correlatejoin may also cut joingraph after pushing down, causing similar problems to window rewriting.
4 Use of Equivalent Columns
If there is a column equivalent to the outer column in the subquery, you can first use this column to rewrite the associated column in the subquery to reduce the associated place to simplify the query. Here is a simple example.
Select t1.c2 From t1 Where t1.c3 < ( Select min(t2.c3) From t2 Where t1.c1 = t2.c1 group by t1.c1 ) -- use t2.c1 instead of t1.ct in the subquery to simplify Select t1.c2 From t1 Where t1.c3 < ( Select min(t2.c3) From t2 Where t1.c1 = t2.c1 group by t2.c1 )
5 Optimization rules related to subqueries
On the one hand, the properties of the operator correaltedjoin give us some information for optimization. Here are some examples:
- The number of rows after the correaltedjoin operator is the same as the number of rows in the left subtree.
- The output of enforcesinglerow is 1 line.
- The associated columns of the outer query determine (function dependency) the newly added output columns of the correaltedjoin.
- The key generated by assignuniqueid has unique attributes, etc., which can be used to simplify aggregation and group by later.
- Sorts in subqueries can be trimmed.
On the other hand, in the rewriting of sub-queries, the simplification of sub-queries can be performed through attribute deduction. for example:
- If the original outer query is unique, there is no need to add the uniqueid column.
- If the output of the child node of enforcesinglerow is always 1 row, it can be clipped.
- The subquery associated with the project, as shown in the figure below, is rewritten as the exists subquery in some cases.
select t1.orderkey, ( select min(t1.orderkey) from orders t2 where t2.orderkey > t1.orderkey ) from orders t1 order by 1
6 Points to pay attention to
The most important things to pay attention to in subquery disassociation are two places, one is to ensure that only one column is added to the outer query, and the other is to deal with null values.
Often mentioned in the literature is a place where classical disassociation is prone to error. For example, in the following query, we have a precondition that t1.c3 are all less than 0. In this case, the associated conditions involved in the subquery should not have any filtering degree. And rewritten as inner join will filter out some rows. Semantically not equivalent.
Select t1.c2 From t1 Where t1.c3 < ( Select COUNT (*) From t2 Where t1.c1 = t2.c1 )
Another error-prone place is LeftMarkJoin in the paper Unnesting Arbitrary Queries , which outputs the same semantics as in. Simply put, it is the result of the following query.
select t1.c1 in ( select t2.c1 from t2) from t1
The logical plan corresponding to this query is as follows:
The output result is the result of adding a column of in to the result of the left subtree. The result of in has three possibilities: true, false and null.
In a distributed environment, repartitioning and placing this operator can easily cause calculation errors related to the null value.
To give a simple example, when leftmarkjoin is the execution mode of repartition, the data in the left table and the right table will be redistributed and reshuffled according to the hash value of c1. Then the null rows in t1.c1 will be shuffled to the same executor. At this time, if no data in the right table is shuffled to this machine, then this executor does not know whether to output null or false for these rows of null. Because the result of null in an empty set is false, and the result of null in any non-empty set is null. At this point, the executor does not know whether the right table is empty.
Efficiency after disassociation
At the beginning, we mentioned the iterative execution method. Here we need to explain that for some correlated subqueries, even if the correlation is resolved into operators such as join/agg, the cost of a cross join is required to calculate the query result.
For example, in the following two queries, the first one is the appearance of our common associated subquery, which can be converted into the form of inner join + early agg. After the second disassociation, it will become a left join on non-equivalent condition (the cost is the same as the cross join).
-- sql 1 SELECT t1.c1 FROM t1 WHERE t1.c2 > ( SELECT min(t2.c2) FROM t2 WHERE t2.c1 = t1.c1 ); -- sql 2 SELECT t1.c1 FROM t1 WHERE t1.c2 > ( SELECT min(t2.c2) FROM t2 WHERE t2.c1 > t1.c1 );
The plan after sq1 is disassociated is as follows:
The plan after sql2 disassociation is as follows:
For sql1, it is semantically understood that each row of the outer query brought into the subquery does not overlap the rows scanned, so the cost is the same as the equivalent condition of innerjoin on. In addition, the result of min in the subquery corresponding to the same outer row is the same, and early agg can be applied to further optimize.
For sql2, it is semantically understood that each row of the outer query must be brought into the subquery to scan all the rows in order to determine what the output of the subquery should be under the condition t2.c1 > t1.c1 . This cost cannot be saved by optimization in order to calculate the result. But the output is always the same for the same t1.c1, and the multiplexing of results in Unnesting Arbitrary Queries  can still yield optimizations.
 Galindo-Legaria, César and Milind Joshi. "Orthogonal Optimization of Subqueries and Aggregations." ACM SIGMOD Record 30.2 (2001): 571-581.
 Neumann, Thomas and Alfons Kemper. "Unnesting Arbitrary Queries." Business, Technology, and Network Database Systems (BTW 2015) (2015).
 Bellamkonda, Srikanth, et al. "Enhanced Subquery Optimization in Oracle." VLDB Foundation Proceedings 2.2 (2009): 1366-1377
 Zuzarte, Calisto, et al. "Winmagic: Eliminating Subqueries Using Windowed Aggregation." Proceedings of the 2003 ACM SIGMOD International Conference on Data Management. 2003.
 Neumann, Thomas, Viktor Leis and Alfons Kemper. "The Complete Story of Connectivity (inHyPer)." Business, Technology and Network Database Systems (BTW 2017) (2017).
 Galindo-Legaria, CA, Parameterized Queries and Nested Equivalents. Technical Report, Microsoft, 2001. MSR-TR-2000-31, 2000.
This article is original content of Alibaba Cloud and may not be reproduced without permission.