Analysis of five join strategies of spark

Time:2021-1-26

Join operation is a very common data processing operation. As a unified big data processing engine, spark provides very rich join scenarios. This article will introduce five join strategies provided by spark, hoping to help you. This paper mainly includes the following contents

  • Factors affecting join operation
  • Five strategies of join execution in spark
  • How spark chooses the join strategy

Factors affecting join operation

The size of the dataset

The size of the data set participating in the join will directly affect the execution efficiency of the join operation. It also affects the selection of join mechanism and the efficiency of join execution.

The condition of join

The condition of a join involves a logical comparison between fields. According to the conditions of join, join can be divided into two categoriesEquivalent connectionandNon equivalent connection. Equivalent connection involves one or more equality conditions that need to be satisfied at the same time. Apply each equivalence condition between the attributes of two input datasets. When using other operators (the operation connector is not=)It is called non equivalent connection.

Type of join

After applying join conditions between records in the input dataset, the join type affects the result of the join operation. There are mainly the following types of join:

  • Internal connection(Inner Join): output only records that match connection conditions from the input dataset.
  • External connection(Outer Join)It is also divided into left outer link, right outer link and all outer link.
  • Semi connection(Semi Join): the right table is only used to filter the data in the left table and does not appear in the result set.
  • Cross connect(Cross Join): cross join returns all the rows in the left table, and each row in the left table is combined with all the rows in the right table. Cross join is also called Cartesian product.

Five strategies of join execution in spark

Spark provides five join mechanisms to perform specific join operations. The five join mechanisms are as follows:

  • Shuffle Hash Join
  • Broadcast Hash Join
  • Sort Merge Join
  • Cartesian Join
  • Broadcast Nested Loop Join

Shuffle Hash Join

brief introduction

When the amount of table data to be joined is large, you can select shuffle hash join. In this way, the large table can be adjustedRepartition according to the key of joinTo ensure that every same join key is sent to the same partition. As shown in the figure below:

Analysis of five join strategies of spark

As shown in the figure above, the basic steps of shuffle hash join are as follows:

  • First of all, two tables participating in join are re partitioned according to the join key. This process involves shuffle. The purpose is to send the data of the same join key to the same partition, so as to facilitate the join in the partition.
  • Secondly, for each partition after shuffle, the partition data of the small table will be constructed into a hash table, and then match with the partition data records of the large table according to the join key.

Conditions and characteristics

  • Only equivalent join is supported, and the join key does not need to be sorted
  • All join types except full outer joins are supported
  • It is a memory intensive operation to build a hash map on a small table. If the data on one side of the hash table is large, it may cause oom
  • Set the parameter tospark.sql.join.prefersortmergeJoin (default true)Set to false

Broadcast Hash Join

brief introduction

Also known asMap end join. When a table is small, we usually choose broadcast hash join to avoid the overhead of shuffle and improve performance. For example, when joining a fact table and a dimension table, the data of the dimension table is usually very small, so broadcast hash join can be used to broadcast the dimension table. In this way, the shuffle of data can be avoided (shuffle operation in spark is very time-consuming), so as to improve the efficiency of join. Before broadcast join, spark needs to send the data on the executor side to the driver side first, and then the driver side broadcasts the data to the executor side. If we need to broadcast more data, it will cause oom on the driver side. The details are as follows:

Analysis of five join strategies of spark

Broadcast hash join mainly includes two stages

  • BroadcastStage: the small table is cached in the executor
  • Hash JoinPhase: execute the hash join in each executor

Conditions and characteristics

  • Only equivalent join is supported, and the join key does not need to be sorted
  • All join types except full outer joins are supported
  • Compared with other join mechanisms, broadcast hash join is more efficient. However, broadcast hash join is a network intensive operation (data redundancy transmission). In addition, it needs to cache data in the driver side. Therefore, when the data volume of a small table is large, oom will occur
  • The amount of data in the broadcast table is less thanspark.sql.autoBroadcastJoinThresholdValue, the default is 10MB (10485760)
  • The size threshold of the broadcasted table cannot exceed 8GB. The source code of spark2.4 is as follows:BroadcastExchangeExec.scala
longMetric("dataSize") += dataSize
          if (dataSize >= (8L << 30)) {
            throw new SparkException(
              s"Cannot broadcast the table that is larger than 8GB: ${dataSize >> 30} GB")
          }
  • The base table cannot be broadcasted. For example, when the left table is connected, only the right table can be broadcasted. In the form of: fact_ table.join (broadcast(dimension_ Table), can not be usedbroadcastPrompt, when the condition is met, it will automatically switch to the join mode.

Sort Merge Join

brief introduction

The join mechanism is the default of spark, which can be adjusted by parametersspark.sql.join.preferSortMergeJoinTo configure, the default value is true, that is, sort merge join is preferred. This method is generally used when two large tables are joined. Sort merge join can reduce the data transmission in the cluster. This method does not load all the data into the memory first, and then hashjoin, but it needs to sort the join keys before joining. Specific figure:

Analysis of five join strategies of spark

Sort merge join mainly includes three stages

  • Shuffle Phase: shuffle the two large tables according to the join key
  • Sort Phase: sort the data in each partition
  • Merge Phase: join the sorted partition data from different tables, and merge the data sets by traversing the elements and connecting the rows with the same join key value

Conditions and characteristics

  • Only equivalent connection is supported
  • All join types are supported
  • Join keys are sorted
  • parameterspark.sql.join . prefersortmergejoin (true by default)Set to true

Cartesian Join

brief introduction

If two tables participating in join in spark do not specify the join key (on condition), Cartesian product join will be generated. The result of this join is actually the product of the number of two rows.

condition

  • Only internal connections are supported
  • Support equivalent and unequal connection
  • Opening parameters spark.sql.crossJoin .enabled=true

Broadcast Nested Loop Join

brief introduction

In this way, when there is no suitable join mechanism to choose, the join strategy will be selected finally. The priority is:Broadcast Hash Join > Sort Merge Join > Shuffle Hash Join > cartesian Join > Broadcast Nested Loop Join.

If the connection between Cartesian and broadcast nested loop join is internal connection or non equivalent connection, the broadcast nested loop policy will be preferred. When the connection is non equivalent and a table can be broadcast, Cartesian join will be selected.

Conditions and characteristics

  • Support equivalent and non equivalent connection
  • All join types are supported. The main optimization points are as follows:

    • Broadcast left table when right outer connection
    • When the left outer connection, broadcast the right table
    • When internal connection, broadcast the left and right tables

How spark chooses the join strategy

Equivalent connection

If there are join hints, follow the order below

  • 1. Broadcast hint: select broadcast hash join if the join type supports it
  • 2. Sort merge hint: if the join key is sorted, select sort merge join
  • 3. Shuffle hash hint: if the join type supports it, select shuffle hash join
  • 4. Shuffle replicate NL hint: if it is inner join, select Cartesian product

If there are no join hints, check the following rules one by one

  • 1. If the join type is supported and one of the tables can be broadcast(spark.sql.autoBroadcastJoinThresholdValue, the default is 10MB), then select broadcast hash join
  • 2. If the parameterspark.sql.join . prefersortmergejoin is set to false, and a table is small enough (you can build a hash map), select shuffle hash join
  • 3. If the join keys are sorted, select sort merge join
  • 4. For inner connection, select Cartesian join
  • 5. If oom may occur or there is no optional execution strategy, broadcast nested loop join will be selected finally

Non equivalent connection

There are join hints, in the following order

  • 1. Broadcast hint: select broadcast nested loop join
  • 2. Shuffle replicate NL hint: if it is an internal connection, select Cartesian product join

If there are no join hints, check the following rules one by one

  • 1. If a table is small enough to be broadcasted, select broadcast nested loop join
  • 2. If it is an internal connection, select Cartesian product join
  • 3. If oom may occur or there is no optional execution strategy, broadcast nested loop join will be selected

Source code fragment of join policy selection

  object JoinSelection extends Strategy
    with PredicateHelper
    with JoinSelectionHelper {
    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      case j @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, nonEquiCond, left, right, hint) =>
        def createBroadcastHashJoin(onlyLookingAtHint: Boolean) = {
          getBroadcastBuildSide(left, right, joinType, hint, onlyLookingAtHint, conf).map {
            buildSide =>
              Seq(joins.BroadcastHashJoinExec(
                leftKeys,
                rightKeys,
                joinType,
                buildSide,
                nonEquiCond,
                planLater(left),
                planLater(right)))
          }
        }

        def createShuffleHashJoin(onlyLookingAtHint: Boolean) = {
          getShuffleHashJoinBuildSide(left, right, joinType, hint, onlyLookingAtHint, conf).map {
            buildSide =>
              Seq(joins.ShuffledHashJoinExec(
                leftKeys,
                rightKeys,
                joinType,
                buildSide,
                nonEquiCond,
                planLater(left),
                planLater(right)))
          }
        }

        def createSortMergeJoin() = {
          if (RowOrdering.isOrderable(leftKeys)) {
            Some(Seq(joins.SortMergeJoinExec(
              leftKeys, rightKeys, joinType, nonEquiCond, planLater(left), planLater(right))))
          } else {
            None
          }
        }

        def createCartesianProduct() = {
          if (joinType.isInstanceOf[InnerLike]) {
            Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), j.condition)))
          } else {
            None
          }
        }

        def createJoinWithoutHint() = {
          createBroadcastHashJoin(false)
            .orElse {
              if (!conf.preferSortMergeJoin) {
                createShuffleHashJoin(false)
              } else {
                None
              }
            }
            .orElse(createSortMergeJoin())
            .orElse(createCartesianProduct())
            .getOrElse {
              val buildSide = getSmallerSide(left, right)
              Seq(joins.BroadcastNestedLoopJoinExec(
                planLater(left), planLater(right), buildSide, joinType, nonEquiCond))
            }
        }

        createBroadcastHashJoin(true)
          .orElse { if (hintToSortMergeJoin(hint)) createSortMergeJoin() else None }
          .orElse(createShuffleHashJoin(true))
          .orElse { if (hintToShuffleReplicateNL(hint)) createCartesianProduct() else None }
          .getOrElse(createJoinWithoutHint())

    
          if (canBuildLeft(joinType)) BuildLeft else BuildRight
        }

        def createBroadcastNLJoin(buildLeft: Boolean, buildRight: Boolean) = {
          val maybeBuildSide = if (buildLeft && buildRight) {
            Some(desiredBuildSide)
          } else if (buildLeft) {
            Some(BuildLeft)
          } else if (buildRight) {
            Some(BuildRight)
          } else {
            None
          }

          maybeBuildSide.map { buildSide =>
            Seq(joins.BroadcastNestedLoopJoinExec(
              planLater(left), planLater(right), buildSide, joinType, condition))
          }
        }

        def createCartesianProduct() = {
          if (joinType.isInstanceOf[InnerLike]) {
            Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))
          } else {
            None
          }
        }

        def createJoinWithoutHint() = {
          createBroadcastNLJoin(canBroadcastBySize(left, conf), canBroadcastBySize(right, conf))
            .orElse(createCartesianProduct())
            .getOrElse {
              Seq(joins.BroadcastNestedLoopJoinExec(
                planLater(left), planLater(right), desiredBuildSide, joinType, condition))
            }
        }

        createBroadcastNLJoin(hintToBroadcastLeft(hint), hintToBroadcastRight(hint))
          .orElse { if (hintToShuffleReplicateNL(hint)) createCartesianProduct() else None }
          .getOrElse(createJoinWithoutHint())
      case _ => Nil
    }
  }

summary

This paper mainly introduces five join strategies provided by spark, and illustrates three important join strategies. Firstly, the factors influencing join are sorted out, then five join strategies of spark are introduced, and the specific meaning and trigger conditions of each join strategy are described. Finally, the corresponding source code fragment of join strategy selection is given. I hope this article can help you.

Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.