[spark series 3] spark 3.0.1 AQE analysis


Introduction to AQE

fromspark configurationAs early as in spark version 1.6, AQE has been established; in spark version 2. X, Intel big data team has carried out corresponding prototype development and practice; in spark 3.0 era, databricks and Intel have contributed new AQE to the community together

Configuration of AQE in spark 3.0.1

Configuration item Default value Official note analysis
spark.sql.adaptive.enabled false Enable adaptive query Set to true here
spark.sql.adaptive.coalescePartitions.enabled true Merge adjacent shuffle partitions (according to the spark.sql.adaptive . advisorypartitionsizeinbytes’ to merge) The default value here is true. For analysis, see analysis 1
spark.sql.adaptive.coalescePartitions.initialPartitionNum (none) Shuffle the initial number of partitions before merging partitions. The default value is spark.sql.shuffle The value of. Partitions See analysis 2 for analysis
spark.sql.adaptive.coalescePartitions.minPartitionNum (none) The minimum number of shuffle partitions after merging is the default parallelism of spark cluster See analysis 3 for analysis
spark.sql.adaptive.advisoryPartitionSizeInBytes 64MB The recommended shuffle partition size is used when merging partitions and handling join data skew See analysis 3 for analysis
spark.sql.adaptive.skewJoin.enabled true Whether to enable adaptive processing of data skew in join
spark.sql.adaptive.skewJoin.skewedPartitionFactor 5 The judgment factor of data skew must satisfy both skew partition factor and skew partition threshold in bytes For analysis, see analysis 4
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes 256MB The threshold of data skew judgment must satisfy both the skew partition factor and the skew partition threshold in bytes For analysis, see analysis 4
spark.sql.adaptive.logLevel debug Configure the schedule change log for adaptive execution Adjust to info level to observe the change of adaptive plan
spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin 0.2 The ratio threshold of non empty partition converted to broadcastjoin, > = this value, will not be converted to broadcastjoin For analysis, see analysis 5

Analysis 1

stayOptimizeSkewedJoin.scalaIn, we see advisor_ PARTITION_ SIZE_ IN_ Bytes, that is spark.sql.adaptive Where. Advisorypartitionsizeinbytes is referenced (optimizeskewjoin is a rule in a physical plan)

   * The goal of skew join optimization is to make the data distribution more even. The target size
   * to split skewed partitions is the average size of non-skewed partition, or the
   * advisory partition size if avg size is smaller than it.
  private def targetSize(sizes: Seq[Long], medianSize: Long): Long = {
    val advisorySize = conf.getConf(SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES)
    val nonSkewSizes = sizes.filterNot(isSkewed(_, medianSize))
    // It's impossible that all the partitions are skewed, as we use median size to define skew.
    math.max(advisorySize, nonSkewSizes.sum / nonSkewSizes.length)

Among them:

  1. Nonskewsizes is the non skewed partition of task
  2. Targetsize returns max (average value of non skewed partitions, advisorysize), where advisorysize is spark.sql.adaptive . advisorypartitionsizeinbytes value

Targetsize doesn’t have to be spark.sql.adaptive . advisorypartitionsizeinbytes value

  1. The mediansize value is the median value of the partition size of the task

Analysis 2


def numShufflePartitions: Int = {
    if (adaptiveExecutionEnabled && coalesceShufflePartitionsEnabled) {
    } else {

Starting from spark 3.0.1, if you have enabled AQE and shuffle partition merging, you will use spark.sql.adaptive . coalescePartitions.initialPartitionNum If there are multiple shuffle stages, increasing the number of partitions can effectively enhance the effect of shuffle partition merging

Analysis 3

stayCoalesceShufflePartitions.scalaCoaleschuffle partitions is a physical plan rule that performs the following actions

 if (!shuffleStages.forall(_.shuffle.canChangeNumPartitions)) {
    } else {
      // `ShuffleQueryStageExec#mapStats` returns None when the input RDD has 0 partitions,
      // we should skip it when calculating the `partitionStartIndices`.
      val validMetrics = shuffleStages.flatMap(_.mapStats)

      // We may have different pre-shuffle partition numbers, don't reduce shuffle partition number
      // in that case. For example when we union fully aggregated data (data is arranged to a single
      // partition) and a result of a SortMergeJoin (multiple partitions).
      val distinctNumPreShufflePartitions =
        validMetrics.map(stats => stats.bytesByPartitionId.length).distinct
      if (validMetrics.nonEmpty && distinctNumPreShufflePartitions.length == 1) {
        // We fall back to Spark default parallelism if the minimum number of coalesced partitions
        // is not set, so to avoid perf regressions compared to no coalescing.
        val minPartitionNum = conf.getConf(SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM)
        val partitionSpecs = ShufflePartitionsUtil.coalescePartitions(
          advisoryTargetSize = conf.getConf(SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES),
          minNumPartitions = minPartitionNum)
        // This transformation adds new nodes, so we must use `transformUp` here.
        val stageIds = shuffleStages.map(_.id).toSet
        plan.transformUp {
          // even for shuffle exchange whose input RDD has 0 partition, we should still update its
          // `partitionStartIndices`, so that all the leaf shuffles in a stage have the same
          // number of output partitions.
          case stage: ShuffleQueryStageExec if stageIds.contains(stage.id) =>
            CustomShuffleReaderExec(stage, partitionSpecs, COALESCED_SHUFFLE_READER_DESCRIPTION)
      } else {

in other words:

  1. If it is a partition operation specified by the user, such as repartition operation, spark.sql.adaptive . coalescePartitions.minPartitionNum Invalid and skip partition merge optimization
  2. If multiple tasks are shuffled and tasks have different partition numbers, spark.sql.adaptive . coalescePartitions.minPartitionNum Invalid and skip partition merge optimization
  3. See s hufflePartitionsUtil.coalescePartition analysis

Analysis 4

stayOptimizeSkewedJoin.scalaIn the middle of the game, we see

   * A partition is considered as a skewed partition if its size is larger than the median
   * partition size * ADAPTIVE_EXECUTION_SKEWED_PARTITION_FACTOR and also larger than
  private def isSkewed(size: Long, medianSize: Long): Boolean = {
    size > medianSize * conf.getConf(SQLConf.SKEW_JOIN_SKEWED_PARTITION_FACTOR) &&
  1. Optimizeskewdjoin is a rule of physical plan. It will judge whether the data is skewed according to issskewd, and must meet the skew requirement_ JOIN_ SKEWED_ PARTITION_ Facts and skew_ JOIN_ SKEWED_ PARTITION_ Threshold will judge that the data is skewed
  2. Mediansize is the median value of the partition size of the task

Analysis 5

The reOptimize method is invoked in the AdaptiveSparkPlanExec method getFinalPhysicalPlan, while the reOptimize method performs the optimization operation of the logic plan.

private def reOptimize(logicalPlan: LogicalPlan): (SparkPlan, LogicalPlan) = {
    val optimized = optimizer.execute(logicalPlan)
    val sparkPlan = context.session.sessionState.planner.plan(ReturnAnswer(optimized)).next()
    val newPlan = applyPhysicalRules(sparkPlan, preprocessingRules ++ queryStagePreparationRules)
    (newPlan, optimized)

There is a demotebroadcasthashjoin rule in optimizer

@transient private val optimizer = new RuleExecutor[LogicalPlan] {
    // TODO add more optimization rules
    override protected def batches: Seq[Batch] = Seq(
      Batch("Demote BroadcastHashJoin", Once, DemoteBroadcastHashJoin(conf))

For demotebroadcasthashjoin, we can judge whether it is broadcastjoin or not

case class DemoteBroadcastHashJoin(conf: SQLConf) extends Rule[LogicalPlan] {

  private def shouldDemote(plan: LogicalPlan): Boolean = plan match {
    case LogicalQueryStage(_, stage: ShuffleQueryStageExec) if stage.resultOption.isDefined
      && stage.mapStats.isDefined =>
      val mapStats = stage.mapStats.get
      val partitionCnt = mapStats.bytesByPartitionId.length
      val nonZeroCnt = mapStats.bytesByPartitionId.count(_ > 0)
      partitionCnt > 0 && nonZeroCnt > 0 &&
        (nonZeroCnt * 1.0 / partitionCnt) < conf.nonEmptyPartitionRatioForBroadcastJoin
    case _ => false

  def apply(plan: LogicalPlan): LogicalPlan = plan.transformDown {
    case j @ Join(left, right, _, _, hint) =>
      var newHint = hint
      if (!hint.leftHint.exists(_.strategy.isDefined) && shouldDemote(left)) {
        newHint = newHint.copy(leftHint =
          Some(hint.leftHint.getOrElse(HintInfo()).copy(strategy = Some(NO_BROADCAST_HASH))))
      if (!hint.rightHint.exists(_.strategy.isDefined) && shouldDemote(right)) {
        newHint = newHint.copy(rightHint =
          Some(hint.rightHint.getOrElse(HintInfo()).copy(strategy = Some(NO_BROADCAST_HASH))))
      if (newHint.ne(hint)) {
        j.copy(hint = newHint)
      } else {

Shouldedemote is to judge whether to broadcast join

  1. The first thing to do is shufflquerystageexec
  2. If the non empty partition ratio column is greater than nonemptypartitionratio for broadcast join, that is spark.sql.adaptive . nonemptypartitionratiofbroadcastjoin, mergehashjoin will not be converted to broadcastjoin
  3. This is easy to happen in the group by scenario in SQL

S hufflePartitionsUtil.coalescePartition Analysis (merge the core code of partition)

seecoalescePartitionAs shown in the figure below:

def coalescePartitions(
      mapOutputStatistics: Array[MapOutputStatistics],
      advisoryTargetSize: Long,
      minNumPartitions: Int): Seq[ShufflePartitionSpec] = {
    // If `minNumPartitions` is very large, it is possible that we need to use a value less than
    // `advisoryTargetSize` as the target size of a coalesced task.
    val totalPostShuffleInputSize = mapOutputStatistics.map(_.bytesByPartitionId.sum).sum
    // The max at here is to make sure that when we have an empty table, we only have a single
    // coalesced partition.
    // There is no particular reason that we pick 16. We just need a number to prevent
    // `maxTargetSize` from being set to 0.
    val maxTargetSize = math.max(
      math.ceil(totalPostShuffleInputSize / minNumPartitions.toDouble).toLong, 16)
    val targetSize = math.min(maxTargetSize, advisoryTargetSize)

    val shuffleIds = mapOutputStatistics.map(_.shuffleId).mkString(", ")
    logInfo(s"For shuffle($shuffleIds), advisory target size: $advisoryTargetSize, " +
      s"actual target size $targetSize.")

    // Make sure these shuffles have the same number of partitions.
    val distinctNumShufflePartitions =
      mapOutputStatistics.map(stats => stats.bytesByPartitionId.length).distinct
    // The reason that we are expecting a single value of the number of shuffle partitions
    // is that when we add Exchanges, we set the number of shuffle partitions
    // (i.e. map output partitions) using a static setting, which is the value of
    // `spark.sql.shuffle.partitions`. Even if two input RDDs are having different
    // number of partitions, they will have the same number of shuffle partitions
    // (i.e. map output partitions).
      distinctNumShufflePartitions.length == 1,
      "There should be only one distinct value of the number of shuffle partitions " +
        "among registered Exchange operators.")

    val numPartitions = distinctNumShufflePartitions.head
    val partitionSpecs = ArrayBuffer[CoalescedPartitionSpec]()
    var latestSplitPoint = 0
    var coalescedSize = 0L
    var i = 0
    while (i < numPartitions) {
      // We calculate the total size of i-th shuffle partitions from all shuffles.
      var totalSizeOfCurrentPartition = 0L
      var j = 0
      while (j < mapOutputStatistics.length) {
        totalSizeOfCurrentPartition += mapOutputStatistics(j).bytesByPartitionId(i)
        j += 1

      // If including the `totalSizeOfCurrentPartition` would exceed the target size, then start a
      // new coalesced partition.
      if (i > latestSplitPoint && coalescedSize + totalSizeOfCurrentPartition > targetSize) {
        partitionSpecs += CoalescedPartitionSpec(latestSplitPoint, i)
        latestSplitPoint = i
        // reset postShuffleInputSize.
        coalescedSize = totalSizeOfCurrentPartition
      } else {
        coalescedSize += totalSizeOfCurrentPartition
      i += 1
    partitionSpecs += CoalescedPartitionSpec(latestSplitPoint, numPartitions)

  1. Totalpostsuffleinput size first calculates the data size of the total shuffle
  2. Maxtargetsize is the maximum value of max (totalpostsuffleinput size / minnumpartitions, 16). Minnumpartitions is spark.sql.adaptive . coalescePartitions.minPartitionNum Value of
  3. Targetsize is min (maxtargetsize, advisorytargetsize). Advisorytargetsize is spark.sql.adaptive The value of. Advisorypartitionsizeinbytes, so this value is only recommended, not necessarily targetsize
  4. The while loop is to merge adjacent partitions. For each adjacent partition in each task, merge until it is no larger than targetsize

OptimizeSkewedJoin.optimizeSkewJoin Analysis (core code of data skew optimization)

seeoptimizeSkewJoinAs shown in the figure below:

def optimizeSkewJoin(plan: SparkPlan): SparkPlan = plan.transformUp {
    case smj @ SortMergeJoinExec(_, _, joinType, _,
        s1 @ SortExec(_, _, ShuffleStage(left: ShuffleStageInfo), _),
        s2 @ SortExec(_, _, ShuffleStage(right: ShuffleStageInfo), _), _)
        if supportedJoinTypes.contains(joinType) =>
      assert(left.partitionsWithSizes.length == right.partitionsWithSizes.length)
      val numPartitions = left.partitionsWithSizes.length
      // Use the median size of the actual (coalesced) partition sizes to detect skewed partitions.
      val leftMedSize = medianSize(left.partitionsWithSizes.map(_._2))
      val rightMedSize = medianSize(right.partitionsWithSizes.map(_._2))
          |Optimizing skewed join.
          |Left side partitions size info:
          |${getSizeInfo(leftMedSize, left.partitionsWithSizes.map(_._2))}
          |Right side partitions size info:
          |${getSizeInfo(rightMedSize, right.partitionsWithSizes.map(_._2))}
      val canSplitLeft = canSplitLeftSide(joinType)
      val canSplitRight = canSplitRightSide(joinType)
      // We use the actual partition sizes (may be coalesced) to calculate target size, so that
      // the final data distribution is even (coalesced partitions + split partitions).
      val leftActualSizes = left.partitionsWithSizes.map(_._2)
      val rightActualSizes = right.partitionsWithSizes.map(_._2)
      val leftTargetSize = targetSize(leftActualSizes, leftMedSize)
      val rightTargetSize = targetSize(rightActualSizes, rightMedSize)

      val leftSidePartitions = mutable.ArrayBuffer.empty[ShufflePartitionSpec]
      val rightSidePartitions = mutable.ArrayBuffer.empty[ShufflePartitionSpec]
      val leftSkewDesc = new SkewDesc
      val rightSkewDesc = new SkewDesc
      for (partitionIndex <- 0 until numPartitions) {
        val isLeftSkew = isSkewed(leftActualSizes(partitionIndex), leftMedSize) && canSplitLeft
        val leftPartSpec = left.partitionsWithSizes(partitionIndex)._1
        val isLeftCoalesced = leftPartSpec.startReducerIndex + 1 < leftPartSpec.endReducerIndex

        val isRightSkew = isSkewed(rightActualSizes(partitionIndex), rightMedSize) && canSplitRight
        val rightPartSpec = right.partitionsWithSizes(partitionIndex)._1
        val isRightCoalesced = rightPartSpec.startReducerIndex + 1 < rightPartSpec.endReducerIndex

        // A skewed partition should never be coalesced, but skip it here just to be safe.
        val leftParts = if (isLeftSkew && !isLeftCoalesced) {
          val reducerId = leftPartSpec.startReducerIndex
          val skewSpecs = createSkewPartitionSpecs(
            left.mapStats.shuffleId, reducerId, leftTargetSize)
          if (skewSpecs.isDefined) {
            logDebug(s"Left side partition $partitionIndex is skewed, split it into " +
              s"${skewSpecs.get.length} parts.")
        } else {

        // A skewed partition should never be coalesced, but skip it here just to be safe.
        val rightParts = if (isRightSkew && !isRightCoalesced) {
          val reducerId = rightPartSpec.startReducerIndex
          val skewSpecs = createSkewPartitionSpecs(
            right.mapStats.shuffleId, reducerId, rightTargetSize)
          if (skewSpecs.isDefined) {
            logDebug(s"Right side partition $partitionIndex is skewed, split it into " +
              s"${skewSpecs.get.length} parts.")
        } else {

        for {
          leftSidePartition <- leftParts
          rightSidePartition <- rightParts
        } {
          leftSidePartitions += leftSidePartition
          rightSidePartitions += rightSidePartition

      logDebug("number of skewed partitions: " +
        s"left ${leftSkewDesc.numPartitions}, right ${rightSkewDesc.numPartitions}")
      if (leftSkewDesc.numPartitions > 0 || rightSkewDesc.numPartitions > 0) {
        val newLeft = CustomShuffleReaderExec(
          left.shuffleStage, leftSidePartitions, leftSkewDesc.toString)
        val newRight = CustomShuffleReaderExec(
          right.shuffleStage, rightSidePartitions, rightSkewDesc.toString)
          left = s1.copy(child = newLeft), right = s2.copy(child = newRight), isSkewJoin = true)
      } else {
  1. The sortmergejoinexec description applies to sort merge join
  2. assert( left.partitionsWithSizes.length == right.partitionsWithSizes.length )Ensure that the number of partitions of two tasks in the join is equal
  3. Calculate the median partition size of the task to join, leftmedsize and rightmedsize, respectively
  4. Calculate the targetsize, lefttargetsize and righttargetsize of the partition of the task to join
  5. Cycle to determine whether each partition of the two tasks has a skew. If the skew is satisfied and the shuffle partition combination has not been carried out, the skew partition will be processed, otherwise it will not be processed
  6. The method of createskewpartitionspecs is as follows:
    1. Get the data size of the corresponding partition of the task of each join
    2. Divide into multiple slices according to targetsize
  7. If there is data skew, it will be constructed and packaged as customshufflereaderexec to run subsequent tasks. Finally, the compute method of shuffledrowrdd will be called to match the case partialmapperpartitionspec to read the data, which will be opened automatically“ spark.sql.adaptive . fetchshuffleblocksintbatch “batch fetching”

Where are optimizesketchedjoin / coalesceshufflepartitions called

For example:AdaptiveSparkPlanExec

@transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq(
    ReuseAdaptiveSubquery(conf, context.subqueryCache),
    // The following two rules need to make use of 'CustomShuffleReaderExec.partitionSpecs'
    // added by `CoalesceShufflePartitions`. So they must be executed after it.

It can be seen that it is called in adaptive sparkplanexec, and coaleschuffle partitions precede optimizeskewjoin,
Adaptive spark plan ExecInsertAdaptiveSparkPlanCalled in
And insertadaptive sparkplanQueryExecutionCalled in

And in INS ertAdaptiveSparkPlan.shouldApplyAQE Methods and supportadaptive

private def shouldApplyAQE(plan: SparkPlan, isSubquery: Boolean): Boolean = {
    conf.getConf(SQLConf.ADAPTIVE_EXECUTION_FORCE_APPLY) || isSubquery || {
      plan.find {
        case _: Exchange => true
        case p if !p.requiredChildDistribution.forall(_ == UnspecifiedDistribution) => true
        case p => p.expressions.exists(_.find {
          case _: SubqueryExpression => true
          case _ => false

private def supportAdaptive(plan: SparkPlan): Boolean = {
    // TODO migrate dynamic-partition-pruning onto adaptive execution.
    sanityCheck(plan) &&
      !plan.logicalLink.exists(_.isStreaming) &&
      !plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined) &&

If the above conditions are not met, AQE will not be turned on. If it is forced to turn on, it can also be configured spark.sql.adaptive . forceapply is true (the prompt in the document is internal configuration)

be careful:

In spark 3.0.1, the following configuration has been abandoned:


Some references in this paper are as follows