Overview & background
MySQL has been criticized for not implementing hashjoin. The latest 8.0.18 release has brought this function, which is gratifying. Sometimes I wonder why MySQL doesn’t support hashjoin all the time? I think it’s probably because MySQL is mostly used in simple OLTP scenarios, and it’s widely used in Internet applications, so the demand is not so urgent. On the other hand, it may be because the community was completely relied on before. After all, the evolution speed of MySQL is limited. After Oracle acquires mysql, the evolution speed of MySQL publishing has obviously accelerated a lot.
The algorithm implementation of hashjoin itself is not complex. To say it is complex, it may be that when the optimizer selects the execution plan, whether to choose hashjoin, choose the appearance, and the inner table may be more complex. Anyway, with hashjoin now, the optimizer has another choice when it chooses the join algorithm. MySQL is based on pragmatism. I believe that this enhancement also answers some questions. Some functions are not incompetent, but have priority.
Before 8.0.18, MySQL only supports the nestloopjoin algorithm. The simplest is simple nestloop join. MySQL has made several optimizations for this algorithm, including block nestloop join, index nestloop join and batched key access. With these optimizations, the urgency of hashjoin can be alleviated to a certain extent. Next, we will take a separate chapter to talk about these join optimizations of MySQL. Next, we will talk about hashjoin.
Hash join algorithm
Nestloopjoin algorithm is simply a double loop, which traverses the surface (drive table), for each row of records on the surface, then traverses the inner table, and then determines whether the join conditions are met, and then determines whether to spit out the records to the last execution node. In terms of algorithm, this is a complexity of M * n. Hash join is an optimization for equal join scenarios. The basic idea is to load the external data into memory and establish a hash table. In this way, you can complete the join operation and output the matching records only by traversing the internal table once. If all the data can be loaded into memory, of course, the logic is simple. Generally speaking, this kind of join is called CHJ (classic hash join). MariaDB has implemented this kind of hash join algorithm before. If all the data cannot be loaded into memory, it needs to be loaded into memory in batches, and then joined in batches. The following describes the implementation of these join algorithms.
Generally, hashjoin consists of two processes: the build process of creating hash table and the probe process of detecting hash table.
Traverse the surface, take the join condition as the key, and query the required column as the value to create the hash table. This involves a basis for selecting the appearance, which is mainly to evaluate the size of the two tables (result sets) involved in the join to determine which one is smaller, so that the limited memory is easier to put down the hash table.
After the hash table is built, the inner table is traversed row by row. For each record in the inner table, the hash value is calculated for the join condition, and it is found in the hash table. If it is matched, the output is output, otherwise it is skipped. After all internal table records are traversed, the whole process is over. Refer to the figure below for the process, which comes from the official MySQL blog
On the left is the build process, on the right is the probe process, country “ID is the equal” join condition, the countries table is the exterior, and the persons table is the interior.
On-Disk Hash Join
The limitation of CHJ is that memory is required to fit the whole surface. In mysql, the memory that can be used by a join is controlled by the join buffer size parameter. If the memory required for a join exceeds the join buffer size, CHJ can’t help but divide the surface into several segments, build each segment one by one, and then traverse the inner table to probe each segment again. Assuming that the surface is divided into n pieces, then scan the inner table n times.Of course, this way is relatively weak. In MySQL 8.0, if the memory required by a join exceeds the join buffer size, the build phase will first use hash calculation to partition the outer surface and generate a temporary partition to disk; then in the probe phase, the same hash algorithm is used to partition the inner table. Because of the same hash function, the same key (the same join condition) must be in the same partition number. Next, CHJ is performed on the data with the same partition number in the external table and the internal table. After all the pieces of CHJ are completed, the whole join process is finished. The cost of this algorithm is to read IO twice for the external table and write IO once for the internal table. Relative toI / O of internal table needs to be scanned n times before, and now the processing method is better.
The upper left side is the process of outer segmentation, the upper right side is the process of inner table segmentation, and the bottom is the process of build + probe for segmentation.
Grace Hash Join
The mainstream databases Oracle, SQL server and PostgreSQL have long supported hashjoin. The join algorithms are similar. Here we introduce the grace hash join algorithm used by Oracle. In fact, the whole process is similar to MySQL’s hashjoin, with one major difference. When the join buffer size is insufficient, MySQL will fragment the appearance, and then CHJ process. However, in extreme cases, if the data distribution is not uniform, a large number of data will be distributed in a bucket after hash, resulting in the insufficient join buffer size after fragmentation. MySQL’s processing method is to read several records at a time and read them to build a hash table, and then probe’s corresponding appearance will be fragmented. After processing a batch, clean up the hash table and repeat the above process until all the data of this partition is processed. This process is the same as CHJ’s processing logic when the join buffer size is insufficient.
When gracehash encounters this situation, it will continue to slice twice until enough memory is available to put down a hash table. However, there are still extreme cases. If the input join conditions are the same, then no matter how many times of hashing, it cannot be separated. At this time, gracehashjoin also degenerates to the same way as mysql.
hybrid hash join
The difference between gracehash join and gracehash join is that if the cache can cache enough partition data, it will try to cache as much as possible, so it is not necessary to read all partitions into memory, then write them to external memory, and then read them into memory to go through the build process. This is an optimization when the memory is relatively abundant compared with the partition, so as to reduce the read / write io of the disk. At present, the hashjoin of oceanbase adopts this join mode.
MySQL join algorithm optimization
Before MySQL 8.0.18, for a long time, there was no hashjoin in MySQL database. The main join algorithm was nest loopjoin. Simplenestloopjoin is obviously inefficient. It needs n full table scans on the internal table. The actual complexity is n * m, n is the number of external records, M is the number of records, which represents the cost of one scan of the internal table. For this reason, MySQL has made several optimizations for simplenetloopjoin, and the pictures posted below are all from the network.
MySQL uses batch technology, that is, using join buffer size to cache enough records at a time. When traversing the internal table, each internal table record and this batch of data are judged conditionally, which reduces the number of times to scan the internal table. If the internal table is large, it indirectly relieves the reading pressure of Io.
If we can index the join conditions of the inner table, then for every record on the surface, we don’t need to scan the inner table in the whole table, only need to perform BTREE lookup once, and the overall time complexity is reduced to n * O (logm). Compared with hashjoin, for every external record, hashjoin is a search of hashtable. Of course, hashtable also has a build time and needs to deal with the lack of memory, which is not necessarily better than inlj.
Batched Key Access
Index nest loopjoin uses the index of join condition to match through BTREE lookup, which reduces the cost of traversing inner table. If the join condition is a non primary key column, it means a lot of back tables and random io. BKA optimizes the way that a batch of data that meets the conditions are sorted by primary key, so that when returning to the table, it is relatively orderly from the perspective of primary key, which alleviates the cost of random io. BKA actually uses the MRR feature (multirangeread) to sort the primary key before accessing the data. The cache size of primary key sorting is controlled by the parameter read ﹣ RND ﹣ buffer ﹣ size.
After MySQL 8.0, a lot of refactoring has been done to the server layer code. Although the optimizer is still far behind Oracle, it has been improving. With the support of hashjoin, MySQL optimizer has more choices and the execution path of SQL can be better, especially for the scenario of equivalent join. Although MySQL has made some optimizations for join before, such as nblj, inlj, BKA, etc., they can’t replace the role of hashjoin. A good database should have rich basic capabilities. Use the optimizer to analyze appropriate scenarios, and then take out corresponding basic capabilities to respond to requests in the most efficient way.