Sparksql source code – physical execution plan node operation

Time:2021-5-11

SparkStrategy: logical to physical

As an implementation independent query optimization framework, catalyst only provides interfaces from the optimized logical execution plan to the real physical execution plan, but does not provide implementations like analyzer and optimizer.

This paper introduces the operation implementation of each physical execution plan of spark SQL component. The optimized logical execution plan is mapped to the physical execution operation class, which is implemented by the sparkstrategies class. Based on the strategy interface provided by catalyst, some strategies are implemented to distinguish the logicalplan subclass and replace it with the appropriate sparkplan subclass.

The inheritance system of sparkplan is as follows. Next, we will introduce the implementation of subclasses.

Sparksql source code - physical execution plan node operation

SparkPlan

There are three main parts: leafnode, unarynode and binarynode

Implementation classes:

Sparksql source code - physical execution plan node operation

Provides four methods that require subclass overloading

1.  // TODO: Move to `DistributedPlan` 
2.  /** Specifies how data is partitioned across different nodes in the cluster. */ 
3.  def outputPartitioning: Partitioning = UnknownPartitioning(0) // TODO: WRONG WIDTH! 
4.  /** Specifies any partition requirements on the input data for this operator. */ 
5.  def requiredChildDistribution: Seq[Distribution] = 
6.   Seq.fill(children.size)(UnspecifiedDistribution) 

8.  def execute(): RDD[Row] 
9.  def executeCollect(): Array[Row] = execute().collect() 

The distribution and partitioning classes are used to represent data distribution. There are the following categories, can be literal.

Sparksql source code - physical execution plan node operation

LeafNode

ExistingRdd

Let’s first introduce the concepts of row and generic row.

Row is the data corresponding to a row of output, providing getxxx (I: int) method

1.  trait Row extends Seq[Any] with Serializable 

The supported data types include int, long, double, float, Boolean, short, byte and string. It supports reading the value of a column by ordinal. Isnullat (I: int) should be judged before reading.

The corresponding mutablerow class provides setXXX (I: int, value: any) method. You can modify (set) the value of an ordinal number

Genericrow is a convenient implementation of row, which stores an array

1.  class GenericRow(protected[catalyst] val values: Array[Any]) extends Row 

Therefore, the corresponding value operation and the operation of judging whether it is empty will be converted into the positioning value operation on the array.

It also has a corresponding genericmutablerow class that can modify (set) values.

Existingrdd is used to convert the RDD data bound to the case class into RDD [row], and extract the output of the case class by reflection. The singleton classes and associated objects of the transformation process are as follows:

1.  object ExistingRdd { 
2.   def convertToCatalyst(a: Any): Any = a match { 
3.   case s: Seq[Any] => s.map(convertToCatalyst) 
4.   case p: Product => new GenericRow(p.productIterator.map(convertToCatalyst).toArray) 
5.   case other => other 
6.   } 
7.   //   Mapping RDD [a] to RDD [row], map   Each row of data in a 
8.   def productToRowRdd[A <: Product](data: RDD[A]): RDD[Row] = { 
9.   // TODO: Reuse the row, don't use map on the product iterator.  Maybe code gen? 
10.   data.map(r => new GenericRow(r.productIterator.map(convertToCatalyst).toArray): Row) 
11.   } 

13.   def fromProductRdd[A <: Product : TypeTag](productRdd: RDD[A]) = { 
14.   ExistingRdd(ScalaReflection.attributesFor[A], productToRowRdd(productRdd)) 
15.   } 
16.  } 

18.  case class ExistingRdd(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode { 
19.   def execute() = rdd 
20.  }

UnaryNode

Aggregate

Implicit transformation declaration, for RDD of local partition, extends some operations

1.  /* Implicit conversions */ 
2.  import org.apache.spark.rdd.PartitionLocalRDDFunctions._ 

Groups input data bygroupingExpressions and computes the aggregateExpressions for each group.

@param child theinput data source.

1.  case class Aggregate( 
2.   partial: Boolean, 
3.   groupingExpressions: Seq[Expression], 
4.   aggregateExpressions: Seq[NamedExpression], 
5.   child: SparkPlan)(@transient sc: SparkContext) 

During initialization, the parameter partial is used to indicate whether the aggregate operation is only performed locally or on other partitions conforming to groupexpression. The logic is as follows:

1.  override def requiredChildDistribution = 
2.   if (partial)  { //  true,   Unknown distribution 
3.   UnspecifiedDistribution :: Nil 
4.  } else { 
5.   //   If it is empty, the distribution is all tuples in a single   In the partition 
6.   if (groupingExpressions == Nil) { 
7.   AllTuples :: Nil 
8.   //   Otherwise, it is distributed in clusters, and the distribution rules come from group expressions 
9.   } else { 
10.   ClusteredDistribution(groupingExpressions) :: Nil 
11.   } 
12.   }

The most important execute() method:

1.  def execute() = attachTree(this, "execute") { 
2.   //   Here, an implicit transformation is performed to generate partition local rddfunctions 
3.   val grouped = child.execute().mapPartitions { iter => 
4.   val buildGrouping = new Projection(groupingExpressions) 
5.   iter.map(row => (buildGrouping(row), row.copy())) 
6.   }.groupByKeyLocally()   //  The result generated here is RDD [(k,   Seq[V])] 

8.   val result = grouped.map { case (group, rows) => 
9.  //   This step will find out the specific spark methods corresponding to aggregateexpressions 
10.  //   The specific method is to traverse the aggregateexpressions and create their own newinstances 
11.   val aggImplementations = createAggregateImplementations() 

13.   // Pull out all the functions so we can feed each row into them. 
14.   val aggFunctions = aggImplementations.flatMap(_ collect { case f: AggregateFunction => f }) 

16.   rows.foreach { row => 
17.   aggFunctions.foreach(_.update(row)) 
18.   } 
19.   buildRow(aggImplementations.map(_.apply(group))) 
20.   } 

22.   // TODO: THIS BREAKS PIPELINING, DOUBLE COMPUTES THE ANSWER, AND USES TOO MUCH MEMORY... 
23.   if (groupingExpressions.isEmpty && result.count == 0) { 
24.   // When there is no output to the Aggregate operator, we still output an empty row. 
25.   val aggImplementations = createAggregateImplementations() 
26.   sc.makeRDD(buildRow(aggImplementations.map(_.apply(null))) :: Nil) 
27.   } else { 
28.   result 
29.   } 
30.  }

The inheritance system of aggregateexpression is as follows. This part of code is in aggregates.scala of catalyst expressions package

Sparksql source code - physical execution plan node operation

His first class implements aggregatefunction with an update (input: row) operation. The update operation of the subclass actually changes the row.

DebugNode

Debugnode is to call execute() to pass in the child sparkplan, and then output the result childrdd one by one to view

1.  case class DebugNode(child: SparkPlan) extends UnaryNode 

Exchange

1.  case class Exchange(newPartitioning: Partitioning, child: SparkPlan) extends UnaryNode 

Implement a new partition strategy for a sparkplan.

Execute() method:

1.  def execute() = attachTree(this , "execute") { 
2.   newPartitioning match { 
3.   case HashPartitioning(expressions, numPartitions) => 
4.   //   Apply expression to each row of each partition in RDD 
5.   val rdd = child.execute().mapPartitions { iter => 
6.   val hashExpressions = new MutableProjection(expressions) 
7.   val   mutablePair  =  new MutablePair[Row,   Row]() //   Equivalent to tuple2 
8.   iter.map(r => mutablePair.update(hashExpressions(r), r)) 
9.   } 
10.   val part = new HashPartitioner(numPartitions) 
11.   //   Generate shuffledrdd 
12.   val shuffled = new ShuffledRDD[Row, Row, MutablePair[Row, Row]](rdd, part) 
13.   shuffled.setSerializer(new SparkSqlSerializer(new SparkConf(false))) 
14.   shuffled.map(_._ 2)  //  Output the second value in tuple2 

16.   case RangePartitioning(sortingExpressions, numPartitions) => 
17.   // TODO: RangePartitioner should take an Ordering. 
18.   implicit val ordering = new RowOrdering(sortingExpressions) 

20.   val rdd = child.execute().mapPartitions { iter => 
21.   val mutablePair = new MutablePair[Row, Null](null, null) 
22.   iter.map(row => mutablePair.update(row, null)) 
23.   } 
24.   val part = new RangePartitioner(numPartitions, rdd, ascending = true) 
25.   val shuffled = new ShuffledRDD[Row, Null, MutablePair[Row, Null]](rdd, part) 
26.   shuffled.setSerializer(new SparkSqlSerializer(new SparkConf(false))) 
27.   shuffled.map(_._1) 

29.   case SinglePartition => 
30.   child.execute().coalesce(1, shuffle = true) 

32.   case _ => sys.error(s"Exchange not implemented for $newPartitioning") 
33.   // TODO: Handle BroadcastPartitioning. 
34.   } 
35.   }

Filter

1.  case class Filter(condition: Expression, child: SparkPlan) extends UnaryNode 

3.  def execute() = child.execute().mapPartitions { iter => 
4.   iter.filter(condition.apply(_).asInstanceOf[Boolean]) 
5.  } 

Generate

1.  case class Generate( 
2.   generator: Generator, 
3.   join: Boolean, 
4.   outer: Boolean, 
5.   child: SparkPlan) 
6.   extends UnaryNode 

First, generator is a subclass of the expression, and the inheritance structure is as follows

Sparksql source code - physical execution plan node operation

The function of generator is to process the row of input and output 0 or more rows. The policy of makeoutput () is implemented by subclass.

The approach of the explode class is to change every value (possibly arraytype or maptype) in the input array into a genericrow (array (V)), and the output is a

Back to the generate operation,

Join Boolean value is used to specify whether the final output result should be joined with the original tuple display of input

The output Boolean value takes effect only when the join is true, and when the output is true, the row of each input will be regarded as at least one output

In general, the generate operation is similar to the flatmap operation in FP

1.  def execute() = { 
2.   if (join) { 
3.   child.execute().mapPartitions { iter => 
4.   val nullValues = Seq.fill(generator.output.size)(Literal(null)) 
5.   // Used to produce rows with no matches when outer = true. 
6.   val outerProjection = 
7.   new Projection(child.output ++ nullValues, child.output) 

9.   val joinProjection = 
10.   new Projection(child.output ++ generator.output, child.output ++ generator.output) 
11.   val joinedRow = new JoinedRow 

13.   iter.flatMap {row => 
14.   val outputRows = generator(row) 
15.   if (outer && outputRows.isEmpty) { 
16.   outerProjection(row) :: Nil 
17.   } else { 
18.   outputRows.map(or => joinProjection(joinedRow(row, or))) 
19.   } 
20.   } 
21.   } 
22.   } else { 
23.   child.execute().mapPartitions(iter => iter.flatMap(generator)) 
24.   } 
25.  } 

Project

1.  case class Project(projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryNode 

Project execution:

1.  def execute() = child.execute().mapPartitions { iter => 
2.   @transient val reusableProjection = new MutableProjection(projectList) 
3.   iter.map(reusableProjection) 
4.  } 

Mutableprojection class is the inheritance class of row = > row. When it is constructed, it receives a SEQ [expression] and allows it to receive an inputschema: SEQ [attribute]. Mutableprojection is used to map a row to a new row and change the internal column according to the expression (and schema, if any).

Sample

1.  case class Sample(fraction: Double, withReplacement: Boolean, seed: Int, child: SparkPlan) extends UnaryNode 

3.  def execute() = child.execute().sample(withReplacement, fraction, seed) 

Sample operation of RDD:

1.  def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = { 
2.   require(fraction >= 0.0, "Invalid fraction value: " + fraction) 
3.   if (withReplacement) { 
4.   new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), seed) 
5.   } else { 
6.   new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), seed) 
7.   } 
8.  } 

The generated partition wise sampled RDD selects samples for each partition of the parent RDD.

Possionsampler and Bernoulli sampler are two implementations of randomsampler.

Sort

1.  case class Sort( 
2.   sortOrder: Seq[SortOrder], 
3.   global: Boolean, 
4.   child: SparkPlan) 
5.   extends UnaryNode 

There are requirements for distribution

1.  override def requiredChildDistribution = 
2.   if (global) OrderedDistribution(sortOrder) :: Nil 
3.  else UnspecifiedDistribution :: Nil 

Sortorder class is the implementation of unaryexpression, which defines the strategy of tuple sorting (increasing or decreasing). This class simply declares the sort policy for child expressions. The reason why expression is inherited is to affect the subtree.

1.  case class SortOrder(child: Expression, direction: SortDirection) extends UnaryExpression 
1.  //   Rowordering inherits ordering [row] 
2.  @transient 
3.   lazy val ordering = new RowOrdering(sortOrder) 

5.   def execute() = attachTree(this, "sort") { 
6.   // TODO: Optimize sorting operation? 
7.   child.execute() 
8.   .mapPartitions(iterator => iterator.map(_.copy()).toArray.sorted(ordering).iterator, 
9.   preservesPartitioning = true) 
10.   }

There is an implicit conversion process. Sorted is a method of array, because ordering is rowordering class, which inherits ordering [t] and is Scala. Math. Ordering [t].

StopAfter

1.  case class StopAfter(limit: Int, child: SparkPlan)(@transient sc: SparkContext) extends UnaryNode 

Stopafter is essentially a limit operation

1.  override def executeCollect() = child.execute().map(_.copy()).take(limit) 
2.  def   execute()  =  sc.makeRDD(executeCollect(),   1) //   Set the parallelism to 1

Makerdd essentially calls the operation of new parallelcollectionrdd [t], where SEQ is the array [t] returned by take() and numslices is 1:

1.  /** Distribute a local Scala collection to form an RDD. */ 
2.   def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = { 
3.   new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]()) 
4.   } 

TopK

1.  case class TopK(limit: Int, sortOrder: Seq[SortOrder], child: SparkPlan) 
2.  (@transient sc: SparkContext) extends UnaryNode 

TOPK can be understood as a combination of sort and stopafter,

1.  @transient 
2.  lazy val ordering = new RowOrdering(sortOrder) 

4.  override def executeCollect() = child.execute().map(_.copy()).takeOrdered(limit)(ordering) 
5.  def execute() = sc.makeRDD(executeCollect(), 1) 

Takeordered (Num) (sorting) actually triggers the top () operation of RDD

1.  def top(num: Int)(implicit ord: Ordering[T]): Array[T] = { 
2.   mapPartitions { items => 
3.   val queue = new BoundedPriorityQueue[T](num) 
4.   queue ++= items 
5.   Iterator.single(queue) 
6.   }.reduce { (queue1, queue2) => 
7.   queue1 ++= queue2 
8.   queue1 
9.   }.toArray.sorted(ord.reverse) 
10.   } 

Boundedpriorityqueue isSpark utilA data structure in the package packages the priority queue. Its optimization point is to limit the size of the priority queue. For example, when adding elements, if it exceeds the size, the heap will be compared and replaced. Suitable for TOPK scenes.

Therefore, before sorting each partition, only one num bpq will be generated (finally only the top num needs to be selected). After merging, the real sorting will be done, and finally the top num will be selected.

BinaryNode

BroadcastNestedLoopJoin

1.  case class BroadcastNestedLoopJoin( 
2.   streamed: SparkPlan, broadcast: SparkPlan, joinType: JoinType, condition: Option[Expression]) 
3.   (@transient sc: SparkContext) 
4.   extends BinaryNode 

A more complex join operation is as follows,

1.  def execute() = { 
2.   //   First, the sparkplan that needs to be broadcast is executed, and then a broadcast operation is performed 
3.   val broadcastedRelation = 
4.   sc.broadcast(broadcast.execute().map(_.copy()).collect().toIndexedSeq) 

6.   val streamedPlusMatches = streamed.execute().mapPartitions { streamedIter => 
7.   val matchedRows = new mutable.ArrayBuffer[Row] 
8.   val includedBroadcastTuples = 
9.   new mutable.BitSet(broadcastedRelation.value.size) 
10.   val joinedRow = new JoinedRow 

12.   streamedIter.foreach { streamedRow => 
13.   var i = 0 
14.   var matched = false 

16.   while (i < broadcastedRelation.value.size) { 
17.   // TODO: One bitset per partition instead of per row. 
18.   val broadcastedRow = broadcastedRelation.value(i) 
19.   if (boundCondition(joinedRow(streamedRow, broadcastedRow)).asInstanceOf[Boolean]) { 
20.   matchedRows += buildRow(streamedRow ++ broadcastedRow) 
21.   matched = true 
22.   includedBroadcastTuples += i 
23.   } 
24.   i += 1 
25.   } 

27.   if (!matched && (joinType == LeftOuter || joinType == FullOuter)) { 
28.   matchedRows += buildRow(streamedRow ++ Array.fill(right.output.size)(null)) 
29.   } 
30.   } 
31.   Iterator((matchedRows, includedBroadcastTuples)) 
32.   } 

34.   val includedBroadcastTuples = streamedPlusMatches.map(_._2) 
35.   val allIncludedBroadcastTuples = 
36.   if (includedBroadcastTuples.count == 0) { 
37.   new scala.collection.mutable.BitSet(broadcastedRelation.value.size) 
38.   } else { 
39.   streamedPlusMatches.map(_._2).reduce(_ ++ _) 
40.   } 

42.   val rightOuterMatches: Seq[Row] = 
43.   if (joinType == RightOuter || joinType == FullOuter) { 
44.   broadcastedRelation.value.zipWithIndex.filter { 
45.   case (row, i) => !allIncludedBroadcastTuples.contains(i) 
46.   }.map { 
47.   // TODO: Use projection. 
48.   case (row, _) => buildRow(Vector.fill(left.output.size)(null) ++ row) 
49.   } 
50.   } else { 
51.   Vector() 
52.   } 

54.   // TODO: Breaks lineage. 
55.   sc.union( 
56.   streamedPlusMatches.flatMap(_._1), sc.makeRDD(rightOuterMatches)) 
57.  }

CartesianProduct

1.  case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNode 

It calls the Cartesian product operation of RDD,

1.  def execute() = 
2.   left.execute().map(_.copy()).cartesian(right.execute().map(_.copy())).map { 
3.   case (l: Row, r: Row) => buildRow(l ++ r) 
4.   } 

SparkEquiInnerJoin

1.  case class SparkEquiInnerJoin( 
2.   leftKeys: Seq[Expression], 
3.   rightKeys: Seq[Expression], 
4.   left: SparkPlan, 
5.   right: SparkPlan) extends BinaryNode

The join operation is suitable for the case that the left and right partitions are the same size and provide their own keys.

Basically, just look at the code. There’s nothing to explain. When doing local join, we use the method in partition local rddfunctions.

1.  def execute() = attachTree(this, "execute") { 
2.   val leftWithKeys = left.execute().mapPartitions { iter => 
3.   val   generateLeftKeys  =  new Projection(leftKeys,   left.output) //   Schema passed in 
4.   iter.map(row => (generateLeftKeys(row), row.copy())) 
5.   } 

7.   val rightWithKeys = right.execute().mapPartitions { iter => 
8.   val generateRightKeys = new Projection(rightKeys, right.output) 
9.   iter.map(row => (generateRightKeys(row), row.copy())) 
10.   } 

12.   // Do the join. 
13.   //   Joinlocally is the method of partition local rddfunctions 
14.   val joined = filterNulls(leftWithKeys).joinLocally(filterNulls(rightWithKeys)) 
15.   // Drop join keys and merge input tuples. 
16.   joined.map { case (_, (leftTuple, rightTuple)) => buildRow(leftTuple ++ rightTuple) } 
17.  } 

19.  /** 
20.   * Filters any rows where the any of the join keys is null, ensuring three-valued 
21.   * logic for the equi-join conditions. 
22.   */ 
23.  protected def filterNulls(rdd: RDD[(Row, Row)]) = 
24.   rdd.filter { 
25.   case (key: Seq[_], _) => !key.exists(_ == null) 
26.   }

The partition local rddfunctions method is as follows, which does not introduce the shuffle operation. The number of partitions of two RDDS needs to be equal.

1.  def joinLocally[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = { 
2.   cogroupLocally(other).flatMapValues { 
3.   case (vs, ws) => for (v <- vs.iterator; w <- ws.iterator) yield (v, w) 
4.   } 
5.  } 

Other

Union

This operation directly inherits sparkplan

1.  case class Union(children: Seq[SparkPlan])(@transient sc: SparkContext) extends SparkPlan 

A unionrdd is generated from the RDD execution results of each of the incoming sparkplan sets

1.  def execute() = sc.union(children.map(_.execute()))