Join operation is a very common data processing operation. As a unified big data processing engine, spark provides very rich join scenarios. This article will introduce five join strategies provided by spark, hoping to help you. This paper mainly includes the following contents
- Factors affecting join operation
- Five strategies of join execution in spark
- How spark chooses the join strategy
Factors affecting join operation
The size of the dataset
The size of the data set participating in the join will directly affect the execution efficiency of the join operation. It also affects the selection of join mechanism and the efficiency of join execution.
The condition of join
The condition of a join involves a logical comparison between fields. According to the conditions of join, join can be divided into two categoriesEquivalent connectionandNon equivalent connection. Equivalent connection involves one or more equality conditions that need to be satisfied at the same time. Apply each equivalence condition between the attributes of two input datasets. When using other operators (the operation connector is not=)It is called non equivalent connection.
Type of join
After applying join conditions between records in the input dataset, the join type affects the result of the join operation. There are mainly the following types of join:
- Internal connection(Inner Join): output only records that match connection conditions from the input dataset.
- External connection(Outer Join)It is also divided into left outer link, right outer link and all outer link.
- Semi connection(Semi Join): the right table is only used to filter the data in the left table and does not appear in the result set.
- Cross connect(Cross Join): cross join returns all the rows in the left table, and each row in the left table is combined with all the rows in the right table. Cross join is also called Cartesian product.
Five strategies of join execution in spark
Spark provides five join mechanisms to perform specific join operations. The five join mechanisms are as follows:
- Shuffle Hash Join
- Broadcast Hash Join
- Sort Merge Join
- Cartesian Join
- Broadcast Nested Loop Join
Shuffle Hash Join
brief introduction
When the amount of table data to be joined is large, you can select shuffle hash join. In this way, the large table can be adjustedRepartition according to the key of joinTo ensure that every same join key is sent to the same partition. As shown in the figure below:
As shown in the figure above, the basic steps of shuffle hash join are as follows:
- First of all, two tables participating in join are re partitioned according to the join key. This process involves shuffle. The purpose is to send the data of the same join key to the same partition, so as to facilitate the join in the partition.
- Secondly, for each partition after shuffle, the partition data of the small table will be constructed into a hash table, and then match with the partition data records of the large table according to the join key.
Conditions and characteristics
- Only equivalent join is supported, and the join key does not need to be sorted
- All join types except full outer joins are supported
- It is a memory intensive operation to build a hash map on a small table. If the data on one side of the hash table is large, it may cause oom
- Set the parameter tospark.sql.join.prefersortmergeJoin (default true)Set to false
Broadcast Hash Join
brief introduction
Also known asMap end join. When a table is small, we usually choose broadcast hash join to avoid the overhead of shuffle and improve performance. For example, when joining a fact table and a dimension table, the data of the dimension table is usually very small, so broadcast hash join can be used to broadcast the dimension table. In this way, the shuffle of data can be avoided (shuffle operation in spark is very time-consuming), so as to improve the efficiency of join. Before broadcast join, spark needs to send the data on the executor side to the driver side first, and then the driver side broadcasts the data to the executor side. If we need to broadcast more data, it will cause oom on the driver side. The details are as follows:
Broadcast hash join mainly includes two stages
- BroadcastStage: the small table is cached in the executor
- Hash JoinPhase: execute the hash join in each executor
Conditions and characteristics
- Only equivalent join is supported, and the join key does not need to be sorted
- All join types except full outer joins are supported
- Compared with other join mechanisms, broadcast hash join is more efficient. However, broadcast hash join is a network intensive operation (data redundancy transmission). In addition, it needs to cache data in the driver side. Therefore, when the data volume of a small table is large, oom will occur
- The amount of data in the broadcast table is less thanspark.sql.autoBroadcastJoinThresholdValue, the default is 10MB (10485760)
- The size threshold of the broadcasted table cannot exceed 8GB. The source code of spark2.4 is as follows:BroadcastExchangeExec.scala
longMetric("dataSize") += dataSize
if (dataSize >= (8L << 30)) {
throw new SparkException(
s"Cannot broadcast the table that is larger than 8GB: ${dataSize >> 30} GB")
}
- The base table cannot be broadcasted. For example, when the left table is connected, only the right table can be broadcasted. In the form of: fact_ table.join (broadcast(dimension_ Table), can not be usedbroadcastPrompt, when the condition is met, it will automatically switch to the join mode.
Sort Merge Join
brief introduction
The join mechanism is the default of spark, which can be adjusted by parametersspark.sql.join.preferSortMergeJoinTo configure, the default value is true, that is, sort merge join is preferred. This method is generally used when two large tables are joined. Sort merge join can reduce the data transmission in the cluster. This method does not load all the data into the memory first, and then hashjoin, but it needs to sort the join keys before joining. Specific figure:
Sort merge join mainly includes three stages
- Shuffle Phase: shuffle the two large tables according to the join key
- Sort Phase: sort the data in each partition
- Merge Phase: join the sorted partition data from different tables, and merge the data sets by traversing the elements and connecting the rows with the same join key value
Conditions and characteristics
- Only equivalent connection is supported
- All join types are supported
- Join keys are sorted
- parameterspark.sql.join . prefersortmergejoin (true by default)Set to true
Cartesian Join
brief introduction
If two tables participating in join in spark do not specify the join key (on condition), Cartesian product join will be generated. The result of this join is actually the product of the number of two rows.
condition
- Only internal connections are supported
- Support equivalent and unequal connection
- Opening parameters spark.sql.crossJoin .enabled=true
Broadcast Nested Loop Join
brief introduction
In this way, when there is no suitable join mechanism to choose, the join strategy will be selected finally. The priority is:Broadcast Hash Join > Sort Merge Join > Shuffle Hash Join > cartesian Join > Broadcast Nested Loop Join.
If the connection between Cartesian and broadcast nested loop join is internal connection or non equivalent connection, the broadcast nested loop policy will be preferred. When the connection is non equivalent and a table can be broadcast, Cartesian join will be selected.
Conditions and characteristics
- Support equivalent and non equivalent connection
-
All join types are supported. The main optimization points are as follows:
- Broadcast left table when right outer connection
- When the left outer connection, broadcast the right table
- When internal connection, broadcast the left and right tables
How spark chooses the join strategy
Equivalent connection
If there are join hints, follow the order below
- 1. Broadcast hint: select broadcast hash join if the join type supports it
- 2. Sort merge hint: if the join key is sorted, select sort merge join
- 3. Shuffle hash hint: if the join type supports it, select shuffle hash join
- 4. Shuffle replicate NL hint: if it is inner join, select Cartesian product
If there are no join hints, check the following rules one by one
- 1. If the join type is supported and one of the tables can be broadcast(spark.sql.autoBroadcastJoinThresholdValue, the default is 10MB), then select broadcast hash join
- 2. If the parameterspark.sql.join . prefersortmergejoin is set to false, and a table is small enough (you can build a hash map), select shuffle hash join
- 3. If the join keys are sorted, select sort merge join
- 4. For inner connection, select Cartesian join
- 5. If oom may occur or there is no optional execution strategy, broadcast nested loop join will be selected finally
Non equivalent connection
There are join hints, in the following order
- 1. Broadcast hint: select broadcast nested loop join
- 2. Shuffle replicate NL hint: if it is an internal connection, select Cartesian product join
If there are no join hints, check the following rules one by one
- 1. If a table is small enough to be broadcasted, select broadcast nested loop join
- 2. If it is an internal connection, select Cartesian product join
- 3. If oom may occur or there is no optional execution strategy, broadcast nested loop join will be selected
Source code fragment of join policy selection
object JoinSelection extends Strategy
with PredicateHelper
with JoinSelectionHelper {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case j @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, nonEquiCond, left, right, hint) =>
def createBroadcastHashJoin(onlyLookingAtHint: Boolean) = {
getBroadcastBuildSide(left, right, joinType, hint, onlyLookingAtHint, conf).map {
buildSide =>
Seq(joins.BroadcastHashJoinExec(
leftKeys,
rightKeys,
joinType,
buildSide,
nonEquiCond,
planLater(left),
planLater(right)))
}
}
def createShuffleHashJoin(onlyLookingAtHint: Boolean) = {
getShuffleHashJoinBuildSide(left, right, joinType, hint, onlyLookingAtHint, conf).map {
buildSide =>
Seq(joins.ShuffledHashJoinExec(
leftKeys,
rightKeys,
joinType,
buildSide,
nonEquiCond,
planLater(left),
planLater(right)))
}
}
def createSortMergeJoin() = {
if (RowOrdering.isOrderable(leftKeys)) {
Some(Seq(joins.SortMergeJoinExec(
leftKeys, rightKeys, joinType, nonEquiCond, planLater(left), planLater(right))))
} else {
None
}
}
def createCartesianProduct() = {
if (joinType.isInstanceOf[InnerLike]) {
Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), j.condition)))
} else {
None
}
}
def createJoinWithoutHint() = {
createBroadcastHashJoin(false)
.orElse {
if (!conf.preferSortMergeJoin) {
createShuffleHashJoin(false)
} else {
None
}
}
.orElse(createSortMergeJoin())
.orElse(createCartesianProduct())
.getOrElse {
val buildSide = getSmallerSide(left, right)
Seq(joins.BroadcastNestedLoopJoinExec(
planLater(left), planLater(right), buildSide, joinType, nonEquiCond))
}
}
createBroadcastHashJoin(true)
.orElse { if (hintToSortMergeJoin(hint)) createSortMergeJoin() else None }
.orElse(createShuffleHashJoin(true))
.orElse { if (hintToShuffleReplicateNL(hint)) createCartesianProduct() else None }
.getOrElse(createJoinWithoutHint())
if (canBuildLeft(joinType)) BuildLeft else BuildRight
}
def createBroadcastNLJoin(buildLeft: Boolean, buildRight: Boolean) = {
val maybeBuildSide = if (buildLeft && buildRight) {
Some(desiredBuildSide)
} else if (buildLeft) {
Some(BuildLeft)
} else if (buildRight) {
Some(BuildRight)
} else {
None
}
maybeBuildSide.map { buildSide =>
Seq(joins.BroadcastNestedLoopJoinExec(
planLater(left), planLater(right), buildSide, joinType, condition))
}
}
def createCartesianProduct() = {
if (joinType.isInstanceOf[InnerLike]) {
Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))
} else {
None
}
}
def createJoinWithoutHint() = {
createBroadcastNLJoin(canBroadcastBySize(left, conf), canBroadcastBySize(right, conf))
.orElse(createCartesianProduct())
.getOrElse {
Seq(joins.BroadcastNestedLoopJoinExec(
planLater(left), planLater(right), desiredBuildSide, joinType, condition))
}
}
createBroadcastNLJoin(hintToBroadcastLeft(hint), hintToBroadcastRight(hint))
.orElse { if (hintToShuffleReplicateNL(hint)) createCartesianProduct() else None }
.getOrElse(createJoinWithoutHint())
case _ => Nil
}
}
summary
This paper mainly introduces five join strategies provided by spark, and illustrates three important join strategies. Firstly, the factors influencing join are sorted out, then five join strategies of spark are introduced, and the specific meaning and trigger conditions of each join strategy are described. Finally, the corresponding source code fragment of join strategy selection is given. I hope this article can help you.
Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.