Three implementation principles of [reprint] [spark] join

Time:2020-4-15
Hash Join

First, let’s look at the SQL statement: select * from order, item where item. Id = order. I_id, the two tables involved in the join are order and item, and the join key is item. ID and order. I_id respectively. Now suppose that the join adopts the hash join algorithm, and the whole process will go through three steps:

  • Determine build table and probe table: this concept is quite important. Build table will be built into a hash table with join key as key, and probe table will use join key to find qualified rows in this hash table table, and then carry out join link. The build and probe tables are determined by spark. In general, small tables are used as build tables, and larger tables are used as probe tables.
  • Build hash table: read the data of build table (item) in turn. Hash each piece of data according to the join key (item. ID). Hash to the corresponding bucket (similar to the principle of HashMap). At last, a hashtable will be generated. The hashtable will be cached in memory. If the memory cannot be stored, it will dump to disk.
  • Match: after generating the hash table, scan the data of the probe table (order) in turn, and use the same hash function (in spark, it is actually the same partitioner) to find the same value of hash (join key) in the hash table. If the match is successful, join the two together.

Three implementation principles of [reprint] [spark] join

Two points are added:

1 performance of hash join. As can be seen from the above schematic diagram, hash join only scans two tables once,The algorithm efficiency is O (a + b)It’s a lot faster than a * B of Cartesian integration of brute force.

2 why should build tables choose small tables as much as possible? In principle, the hash table built needs to be accessed frequently, soIt is better to load all hash tables into memoryLi, it’s decidedHash join is only suitable for the scenario of at least one small table join

After reading the kernel of hash join, let’s take a look at this single machine algorithm and how to do it in the case of big data distribution. At present, there are two mature algorithms:broadcast hash joinandshuffler hash join

Broadcast Hash Join

Broadcast hash join is to distribute one small table broadcast to the partition node where the other large table is located, and to hash join with the partition record on it simultaneously. Broadcast is suitable for scenarios where small tables are very small and can be broadcast directly.

In execution, it can be divided into the following two steps:

1. Broadcast stage: broadcast the small table to all hosts where the large table is located. It can be distributed by driver or P2P.

2. Hash join stage: execute single machine hash join, small table mapping and large table Heuristics on each executor;

Note that in spark, the default limit for small tables that can be broadcast is less than 10m. (parameter isspark.sql.autoBroadcastJoinThreshold

Three implementation principles of [reprint] [spark] join

Shuffle Hash Join

When a table of join is very small, using broadcast hash join is undoubtedly the most efficient. However, as the small table becomes larger, the memory, bandwidth and other resources required for broadcasting will inevitably be too large, so there will be a default resource limit of 10m.

Therefore, when the small table becomes larger, you need to use another kind of hash join: shuffle hash join.

The shuffle hash join partitions according to the join key. According to the same principle that the same key must partition the same, the join of the large table is divided into the join of the small table, and the parallel execution of the cluster resources is fully utilized.

In execution, it can be divided into the following two steps:

1. Shuffle stage: partition the two tables according to the join key, redistribute the records of the same join key to the same node, and redistribute the data of the two tables to all nodes in the cluster.

2. Hash join stage: the data on each partition node independently executes the single machine hash join algorithm.

Three implementation principles of [reprint] [spark] join

As I said just now, hash join is suitable for situations where there is at least one small watch. What if two large watches need join? You need sort merge join.

Sort-Merge Join

Spark SQL adopts a new algorithm for joining two large tables – sort merge join. The whole process is divided into three steps:

1. Shuffle stage: partition the two large tables according to the join key, and the data of the two tables will be distributed to the whole cluster for distributed parallel processing

2. Sort stage: sort the data of two tables of a single partition node respectively

3. Merge stage: perform join operation on the data of two partitioned tables arranged in order. The join operation is very simple. It traverses two ordered sequences respectively. If the same join key is encountered, the output will be merged. Otherwise, it will continue to take the smaller side of the key.

Three implementation principles of [reprint] [spark] join

After careful analysis, it will be found that the cost of sort merge join is not less than shuffle hash join, but much more. Why does sparksql choose to use sort merge join algorithm in the scenario of two large tables?

This is related to the shuffle implementation of spark. At present, the shuffle implementation of spark is applicable to sort based shuffle algorithm. Therefore, after shuffle, partition data is sorted by key. Therefore, in theory, we can think that the data after shuffle does not need sort, and can be directly merged.

Conclusion: how to optimize

After the above analysis, it can be clear that each join algorithm has its own applicable scenarios. During optimization, in addition to selecting the appropriate join algorithm according to the business scenario, the following points should also be noted:

1. It is better to avoid the join query between large tables when designing data warehouse.

2 spark SQL can also increase the parameter spark.sql.autobroadcastjointhreshold according to the amount of memory resources and bandwidth resources, so that more joins can actually be executed as broadcast hash joins.