[transfer] detailed explanation of the parsing process of spark SQL initial experience


I haven’t updated my blog for a long time. I learned some R language and machine learning content and took some notes. Later, I will put them on my blog for everyone to share. I planned to update spark SQL a month ago because some other things were delayed. Today, I’ll simply write something, spark1 2 is coming out soon. I don’t know if the change will be great. It is said that a lot of new functions have been added. I’m looking forward to it

First of all, the code of this version is 1.1, which is 1.0.

Spark supports two modes. One is to write SQL directly in spark. You can query objects through SQL, similar to Net, another HQL that supports hive. Either way, the steps mentioned below will be different. The difference is the specific implementation process. Let’s talk about this process.

SQL parses into logicplan

Use the shortcut key Ctrl + Shift + n of idea to open the sqlquerysuite file for debugging.

  def sql(sqlText: String): SchemaRDD = {
    if (dialect == "sql") {
      new SchemaRDD(this, parseSql(sqlText))
    } else {
      sys.error(s"Unsupported SQL dialect: $dialect")

It can be seen from here that the first step is to parse the SQL and finally convert it into a schemardd. Click to enter the parsesql function and find that the process of parsing SQL is in the sqlparser class. In the apply method of sqlparser, we can see this code in the else statement.

//Parse the input and return success if it conforms to the query mode
      phrase(query)(new lexical.Scanner(input)) match {
        case Success(r, x) => r
        case x => sys.error(x.toString)

Here we mainly focus on query.

  protected lazy val query: Parser[LogicalPlan] = (
    select * (
        UNION ~ ALL ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Union(q1, q2) } |
        INTERSECT ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Intersect(q1, q2) } |
        EXCEPT ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Except(q1, q2)} |
        UNION ~ opt(DISTINCT) ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Distinct(Union(q1, q2)) }
    | insert | cache

There are many operators that you can’t understand. Please go to the following website to learn. It can be seen here that the SQL statements it currently supports are only select and insert.


Let’s continue to look at select.

//~ > only keep the mode opt on the right. Optional ~ synthesize opt (distinct) ~ projects in order~
    opt(from) ~ opt(filter) ~
    opt(grouping) ~
    opt(having) ~
    opt(orderBy) ~
        val base = r.getOrElse(NoRelation)
        val withFilter = f.map(f => Filter(f, base)).getOrElse(base)
        val withProjection =
          g.map {g =>
            Aggregate(assignAliases(g), assignAliases(p), withFilter)
          }.getOrElse(Project(assignAliases(p), withFilter))
        val withDistinct = d.map(_ => Distinct(withProjection)).getOrElse(withProjection)
        val withHaving = h.map(h => Filter(h, withDistinct)).getOrElse(withDistinct)
        val withOrder = o.map(o => Sort(o, withHaving)).getOrElse(withHaving)
        val withLimit = l.map { l => Limit(l, withOrder) }.getOrElse(withOrder)

It can be seen that its parsing of SQL is consistent with our common SQL writing method. There is recursion further down, which is not so easy to understand. I won’t go on here. When parsing hive, I will focus on it. I think we still use more data sets from hive. After all, hive is so stable.

From here, we can know that the first step is to parse SQL into a logicplan through parser.

Conversion process from logicplan to RDD

OK, let’s go back to the code just now, and then we should look at schemardd.

  override def compute(split: Partition, context: TaskContext): Iterator[Row] =
    firstParent[Row].compute(split, context).map(_.copy())

  override def getPartitions: Array[Partition] = firstParent[Row].partitions

  override protected def getDependencies: Seq[Dependency[_]] =
    List(new OneToOneDependency(queryExecution.toRdd))

If schemardd is an RDD, its three most important attributes: compute function, partition and dependency are all in it. We won’t look at other functions.

It’s strange that the RDD we created from new can’t be relied on. What’s the queryexecution? Click in and see the code jump to the schemarddlike inherited by schemardd.

lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan)

protected[sql] def executePlan(plan: LogicalPlan): this.QueryExecution =
    new this.QueryExecution { val logical = plan }

Put these two short pieces of code together. The executeplan method is to create a queryexecution. Let’s continue to look at the queryexecution class.

lazy val analyzed = ExtractPythonUdfs(analyzer(logical))
    lazy val optimizedPlan = optimizer(analyzed)
    lazy val sparkPlan = {
    //Add shuffle operation when necessary
    lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
    lazy val toRdd: RDD[Row] = executedPlan.execute()

It can be seen from here that the logicplan has gone through five steps of conversion. It needs to be processed by analyzer and optimizer, and then converted to sparkplan. Before execution, it needs to be processed by prepareforexecution, and finally it calls the execute method to convert to rdd

Now let’s talk about what these things are doing step by step.

First of all, let’s take a look at anayzer, which is inherited from ruleexecutor. Here’s an aside. When Michael armtrust, the author of spark SQL, introduced catalyst on spark submit in 2013, he said that it is very difficult to optimize the execution of an SQL as a whole. All of them are designed as this optimization method based on a small rule, which is simple and easy to maintain.

OK, let’s take a look at the apply method of ruleexecutor.

def apply(plan: TreeType): TreeType = {
    var curPlan = plan
    //The rules also deal with plans in batches
    batches.foreach { batch =>
      val batchStartPlan = curPlan
      var iteration = 1
      var lastPlan = curPlan
      var continue = true

      // Run until fix point (or the max number of iterations as specified in the strategy.
      while (continue) {
        //The plan is processed one by one from left to right with the small rules of batch
        curPlan = batch.rules.foldLeft(curPlan) {
          case (plan, rule) =>
            val result = rule(plan)
        iteration += 1
        //If the specified number of iterations is exceeded, it will exit
        if (iteration > batch.strategy.maxIterations) {
              continue = false
        //The plan that has been processed successfully will change. If it is the same as the plan contacted in the last processing, it indicates that there is no room for optimization and can be ended. This is the fixed point mentioned above
        if (curPlan.fastEquals(lastPlan)) {
          continue = false
        lastPlan = curPlan


After reading the ruleexecutor, let’s continue to look at analyzer. Next, I only post the code of batches, and the rest should be read by ourselves.

val batches: Seq[Batch] = Seq(
    //When encountering the logicplan inherited from the multiinstancerelations interface, if the ID is found to be repeated later, they are forced to generate a new globally unique ID
    //Inmemoryrelation, logicregion, parquetrelation and sparklogicplan are involved
    Batch("MultiInstanceRelations", Once,
    //If case insensitive, change all attributes to lowercase
    Batch("CaseInsensitiveAttributeReferences", Once,
      (if (caseSensitive) Nil else LowercaseAttributeReferences :: Nil) : _*),
    //This cow wants to iterate 100 times.
    Batch("Resolution", fixedPoint,
      //Parsing the attributes generated from the operation of child nodes is generally caused by aliases, such as a.id
      ResolveReferences ::
      //Resolving table names through catalog
      ResolveRelations ::
      //In the select language, the attribute of order by is often not written in the front. When querying, you also need to find out these fields and delete them after sorting
      ResolveSortReferences ::
      //As mentioned earlier
      NewRelationInstances ::
      //Clear attributes that are mistaken for aliases, such as sum (score) as a. in fact, it should be sum (score)
      //When it is parsed, it is parsed into project (SEQ (alias (G: generator,)), Just return to the generator directly
      ImplicitGenerate ::
      //Handle *, such as select *, count (*)
      StarExpansion ::
      //Analytic function
      ResolveFunctions ::
      //Resolve global aggregation functions, such as select sum (score) from table
      GlobalAggregates ::
      //Parse the following filter condition of the clause (having > sum, for example)
      UnresolvedHavingClauseAttributes ::
      //Typecoercionrules are hive's type conversion rules
      typeCoercionRules :_*),
    //Check whether the attributes of all nodes have been processed. If there are still unresolved attributes, an error will be reported here!
    Batch("Check Analysis", Once,
    //Clear the redundant operators, now subquery and lowercaseschema,
    //The first is the subquery. The second hivecontext query tree converts all the subnodes into lowercase
    Batch("AnalysisOperators", fixedPoint,

It can be seen that analyzer parses the unresolved logicplan into resolved, and parses the table name, field, function and alias.

Let’s continue to look at optimizer. In terms of words, it is used for optimization, but in terms of code, it is more to filter some garbage statements we write, and does not do any actual optimization.

object Optimizer extends RuleExecutor[LogicalPlan] {
  val batches =
      //Recursively merge two adjacent limits
    Batch("Combine Limits", FixedPoint(100),
      CombineLimits) ::
    Batch("ConstantFolding", FixedPoint(100),
      //Replace null value
      //Replace some simple constant expressions, such as 1 in (1,2) and directly return a true
      //Simplify like statements and avoid full table scanning. Currently, '% demo%', '% demo','demo * ','demo' is supported
      //Simplify the filtering conditions, such as replacing true and score > 0 with score > 0
      //Simplify the filter, such as where 1 = 1 or where 1 = 2. The former directly removes the filter, and the latter query is unnecessary
      //Simplify conversion. For example, if the data types of two comparison fields are the same, conversion is not required
      //Simplify case conversion, for example, from upper (upper ('a ')) to upper ('a')
      SimplifyCaseConversionExpressions) ::
    Batch("Filter Pushdown", FixedPoint(100),
      //Recursively merge two adjacent filter conditions
      //Replace the filter from the expression with, filter first, then get the expression, and remove the alias attribute in the filter
      //Typical example: select * from (select a, B from table) where a = 1
      //Replace with select * from (select a, B from table where a = 1)
      //Filter those in the on condition of the join that can be filtered in the original table first
      //For example, select a, B from X join y on x.id = y.id and X.A > 0 and y.b > 0
      //This statement can be rewritten as select a, B from x where X.A > 0 join (select * from y where y.b > 0) on x.id = y.id
      //Remove some useless columns
      ColumnPruning) :: Nil

It’s really well intentioned. It seems that we should pay attention to it when writing SQL. You see how much effort people spend to optimize our bad SQL… If I don’t optimize… If you write badly, go slow!

Next, let’s change this sentence to Planner (optimized plan) Next () let’s take a look at sparkplanner first.

protected[sql] class SparkPlanner extends SparkStrategies {
    val sparkContext: SparkContext = self.sparkContext

    val sqlContext: SQLContext = self

    def codegenEnabled = self.codegenEnabled

    def numPartitions = self.numShufflePartitions
    //Convert logicplan into actual operation. The specific operation class is at org apache. spark. sql. Under the execution package
    val strategies: Seq[Strategy] =
      //Convert cache, set and export commands into actual commands
      CommandStrategy(self) ::
      //Convert limit to takeordered operation
      TakeOrdered ::
      //The name is a bit confusing, that is, the conversion aggregation operation
      HashAggregation ::
      //Left semi join only displays the information of the table on the left when the connection condition is established
      //For example, select * from table1 left semi join table2 on (table1. Student_no = table2. Student_no);
      //It only displays students in table1_ No is the information in Table 2, which can be used to replace the exist statement
      LeftSemiJoin ::
      //Equivalent connection operation, some optimized contents, if the size of the table is less than spark sql. Bytes set by autobroadcastjointhreshold
      //It is automatically converted to broadcasthashjoin, that is, caching the table, similar to hive's map join (the order is to judge the right table first and then the right table).
      //The default value of this parameter is 10000
      //In addition, during internal connection, the size of the left table and the right table will be judged. The shuffle will take the data, the large table will not move, and the data will be pulled from the small table for calculation
      HashJoin ::
      //Execute the select statement in memory for filtering and caching
      InMemoryScans ::
      //Parquet related operations
      ParquetOperations ::
      //Basic operation
      BasicOperators ::
      //Unconditional connections or inner connections are Cartesian products
      CartesianProduct ::
      //Connect nestedloop to broadcast connection
      BroadcastNestedLoopJoin :: Nil


This step is to convert the logical plan into a physical plan, or an execution plan. There are many concepts I haven’t heard of before. I didn’t know until I checked it onlinedatabaseThere are still so many statements about the implementation plan of the database, which needs to be understood by those who specialize in the database. The remaining two steps are prepareforexecute and execute.

The prepareforexecution operation is to check whether the distribution in the physical plan meets the requirements of partitioning. If it does not meet the requirements, you need to re partition and add a shuffle operation. This is not understood for the time being and needs to be studied carefully in the future. Finally, call the execute method of sparkplan. Here we will talk about the tree structure.

[transfer] detailed explanation of the parsing process of spark SQL initial experience

SQL parses out a binary tree structure, which is used for both logical plans and physical plans. Therefore, in the code, you can see that the specific implementation classes of logicplan and sparkplan inherit the three types of nodes in the above figure.

The execute method of sparkplan that is not a leafnode will have such a sentence as child Execute (), because it needs to execute the execution of child nodes to return data. The execution process is a pre order traversal.

Finally, this process is also represented by a diagram, which is convenient for memory.

[transfer] detailed explanation of the parsing process of spark SQL initial experience

(1) A parser is used to convert SQL statements into unresolved logicplan. At present, there are two kinds of parsers, sqlparser and hiveql.

(2) Resolve the unresolved content in logicplan into resolved through analyzer, including table name, function, field, alias, etc.

(3) Filter out some garbage SQL statements through optimizer.

(4) Use strategies to convert logical plans into physical plans that can be executed. Specific classes include sparkstrategies and hivestrategies.

(5) Use the prepareforexecution method to check before execution.

(6) Traversal first, and call the execute method of the execution plan tree.