I haven’t updated my blog for a long time. I learned some R language and machine learning content and took some notes. Later, I will put them on my blog for everyone to share. I planned to update spark SQL a month ago because some other things were delayed. Today, I’ll simply write something, spark1 2 is coming out soon. I don’t know if the change will be great. It is said that a lot of new functions have been added. I’m looking forward to it
First of all, the code of this version is 1.1, which is 1.0.
Spark supports two modes. One is to write SQL directly in spark. You can query objects through SQL, similar to Net, another HQL that supports hive. Either way, the steps mentioned below will be different. The difference is the specific implementation process. Let’s talk about this process.
SQL parses into logicplan
Use the shortcut key Ctrl + Shift + n of idea to open the sqlquerysuite file for debugging.
def sql(sqlText: String): SchemaRDD = {
if (dialect == "sql") {
new SchemaRDD(this, parseSql(sqlText))
} else {
sys.error(s"Unsupported SQL dialect: $dialect")
}
}
It can be seen from here that the first step is to parse the SQL and finally convert it into a schemardd. Click to enter the parsesql function and find that the process of parsing SQL is in the sqlparser class. In the apply method of sqlparser, we can see this code in the else statement.
//Parse the input and return success if it conforms to the query mode
phrase(query)(new lexical.Scanner(input)) match {
case Success(r, x) => r
case x => sys.error(x.toString)
}
Here we mainly focus on query.
protected lazy val query: Parser[LogicalPlan] = (
select * (
UNION ~ ALL ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Union(q1, q2) } |
INTERSECT ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Intersect(q1, q2) } |
EXCEPT ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Except(q1, q2)} |
UNION ~ opt(DISTINCT) ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Distinct(Union(q1, q2)) }
)
| insert | cache
)
There are many operators that you can’t understand. Please go to the following website to learn. It can be seen here that the SQL statements it currently supports are only select and insert.
http://www.scala-lang.org/api/2.10.4/index.html#scala.util.parsing.combinator.Parsers$Parser
Let’s continue to look at select.
//~ > only keep the mode opt on the right. Optional ~ synthesize opt (distinct) ~ projects in order~
opt(from) ~ opt(filter) ~
opt(grouping) ~
opt(having) ~
opt(orderBy) ~
opt(limit)
val base = r.getOrElse(NoRelation)
val withFilter = f.map(f => Filter(f, base)).getOrElse(base)
val withProjection =
g.map {g =>
Aggregate(assignAliases(g), assignAliases(p), withFilter)
}.getOrElse(Project(assignAliases(p), withFilter))
val withDistinct = d.map(_ => Distinct(withProjection)).getOrElse(withProjection)
val withHaving = h.map(h => Filter(h, withDistinct)).getOrElse(withDistinct)
val withOrder = o.map(o => Sort(o, withHaving)).getOrElse(withHaving)
val withLimit = l.map { l => Limit(l, withOrder) }.getOrElse(withOrder)
withLimit
}
It can be seen that its parsing of SQL is consistent with our common SQL writing method. There is recursion further down, which is not so easy to understand. I won’t go on here. When parsing hive, I will focus on it. I think we still use more data sets from hive. After all, hive is so stable.
From here, we can know that the first step is to parse SQL into a logicplan through parser.
Conversion process from logicplan to RDD
OK, let’s go back to the code just now, and then we should look at schemardd.
override def compute(split: Partition, context: TaskContext): Iterator[Row] =
firstParent[Row].compute(split, context).map(_.copy())
override def getPartitions: Array[Partition] = firstParent[Row].partitions
override protected def getDependencies: Seq[Dependency[_]] =
List(new OneToOneDependency(queryExecution.toRdd))
If schemardd is an RDD, its three most important attributes: compute function, partition and dependency are all in it. We won’t look at other functions.
It’s strange that the RDD we created from new can’t be relied on. What’s the queryexecution? Click in and see the code jump to the schemarddlike inherited by schemardd.
lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan)
protected[sql] def executePlan(plan: LogicalPlan): this.QueryExecution =
new this.QueryExecution { val logical = plan }
Put these two short pieces of code together. The executeplan method is to create a queryexecution. Let’s continue to look at the queryexecution class.
lazy val analyzed = ExtractPythonUdfs(analyzer(logical))
lazy val optimizedPlan = optimizer(analyzed)
lazy val sparkPlan = {
SparkPlan.currentContext.set(self)
planner(optimizedPlan).next()
}
//Add shuffle operation when necessary
lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
lazy val toRdd: RDD[Row] = executedPlan.execute()
It can be seen from here that the logicplan has gone through five steps of conversion. It needs to be processed by analyzer and optimizer, and then converted to sparkplan. Before execution, it needs to be processed by prepareforexecution, and finally it calls the execute method to convert to rdd
Now let’s talk about what these things are doing step by step.
First of all, let’s take a look at anayzer, which is inherited from ruleexecutor. Here’s an aside. When Michael armtrust, the author of spark SQL, introduced catalyst on spark submit in 2013, he said that it is very difficult to optimize the execution of an SQL as a whole. All of them are designed as this optimization method based on a small rule, which is simple and easy to maintain.
OK, let’s take a look at the apply method of ruleexecutor.
def apply(plan: TreeType): TreeType = {
var curPlan = plan
//The rules also deal with plans in batches
batches.foreach { batch =>
val batchStartPlan = curPlan
var iteration = 1
var lastPlan = curPlan
var continue = true
// Run until fix point (or the max number of iterations as specified in the strategy.
while (continue) {
//The plan is processed one by one from left to right with the small rules of batch
curPlan = batch.rules.foldLeft(curPlan) {
case (plan, rule) =>
val result = rule(plan)
result
}
iteration += 1
//If the specified number of iterations is exceeded, it will exit
if (iteration > batch.strategy.maxIterations) {
continue = false
}
//The plan that has been processed successfully will change. If it is the same as the plan contacted in the last processing, it indicates that there is no room for optimization and can be ended. This is the fixed point mentioned above
if (curPlan.fastEquals(lastPlan)) {
continue = false
}
lastPlan = curPlan
}
}
curPlan
}
After reading the ruleexecutor, let’s continue to look at analyzer. Next, I only post the code of batches, and the rest should be read by ourselves.
val batches: Seq[Batch] = Seq(
//When encountering the logicplan inherited from the multiinstancerelations interface, if the ID is found to be repeated later, they are forced to generate a new globally unique ID
//Inmemoryrelation, logicregion, parquetrelation and sparklogicplan are involved
Batch("MultiInstanceRelations", Once,
NewRelationInstances),
//If case insensitive, change all attributes to lowercase
Batch("CaseInsensitiveAttributeReferences", Once,
(if (caseSensitive) Nil else LowercaseAttributeReferences :: Nil) : _*),
//This cow wants to iterate 100 times.
Batch("Resolution", fixedPoint,
//Parsing the attributes generated from the operation of child nodes is generally caused by aliases, such as a.id
ResolveReferences ::
//Resolving table names through catalog
ResolveRelations ::
//In the select language, the attribute of order by is often not written in the front. When querying, you also need to find out these fields and delete them after sorting
ResolveSortReferences ::
//As mentioned earlier
NewRelationInstances ::
//Clear attributes that are mistaken for aliases, such as sum (score) as a. in fact, it should be sum (score)
//When it is parsed, it is parsed into project (SEQ (alias (G: generator,)), Just return to the generator directly
ImplicitGenerate ::
//Handle *, such as select *, count (*)
StarExpansion ::
//Analytic function
ResolveFunctions ::
//Resolve global aggregation functions, such as select sum (score) from table
GlobalAggregates ::
//Parse the following filter condition of the clause (having > sum, for example)
UnresolvedHavingClauseAttributes ::
//Typecoercionrules are hive's type conversion rules
typeCoercionRules :_*),
//Check whether the attributes of all nodes have been processed. If there are still unresolved attributes, an error will be reported here!
Batch("Check Analysis", Once,
CheckResolution),
//Clear the redundant operators, now subquery and lowercaseschema,
//The first is the subquery. The second hivecontext query tree converts all the subnodes into lowercase
Batch("AnalysisOperators", fixedPoint,
EliminateAnalysisOperators)
)
It can be seen that analyzer parses the unresolved logicplan into resolved, and parses the table name, field, function and alias.
Let’s continue to look at optimizer. In terms of words, it is used for optimization, but in terms of code, it is more to filter some garbage statements we write, and does not do any actual optimization.
object Optimizer extends RuleExecutor[LogicalPlan] {
val batches =
//Recursively merge two adjacent limits
Batch("Combine Limits", FixedPoint(100),
CombineLimits) ::
Batch("ConstantFolding", FixedPoint(100),
//Replace null value
NullPropagation,
//Replace some simple constant expressions, such as 1 in (1,2) and directly return a true
ConstantFolding,
//Simplify like statements and avoid full table scanning. Currently, '% demo%', '% demo','demo * ','demo' is supported
LikeSimplification,
//Simplify the filtering conditions, such as replacing true and score > 0 with score > 0
BooleanSimplification,
//Simplify the filter, such as where 1 = 1 or where 1 = 2. The former directly removes the filter, and the latter query is unnecessary
SimplifyFilters,
//Simplify conversion. For example, if the data types of two comparison fields are the same, conversion is not required
SimplifyCasts,
//Simplify case conversion, for example, from upper (upper ('a ')) to upper ('a')
SimplifyCaseConversionExpressions) ::
Batch("Filter Pushdown", FixedPoint(100),
//Recursively merge two adjacent filter conditions
CombineFilters,
//Replace the filter from the expression with, filter first, then get the expression, and remove the alias attribute in the filter
//Typical example: select * from (select a, B from table) where a = 1
//Replace with select * from (select a, B from table where a = 1)
PushPredicateThroughProject,
//Filter those in the on condition of the join that can be filtered in the original table first
//For example, select a, B from X join y on x.id = y.id and X.A > 0 and y.b > 0
//This statement can be rewritten as select a, B from x where X.A > 0 join (select * from y where y.b > 0) on x.id = y.id
PushPredicateThroughJoin,
//Remove some useless columns
ColumnPruning) :: Nil
}
It’s really well intentioned. It seems that we should pay attention to it when writing SQL. You see how much effort people spend to optimize our bad SQL… If I don’t optimize… If you write badly, go slow!
Next, let’s change this sentence to Planner (optimized plan) Next () let’s take a look at sparkplanner first.
protected[sql] class SparkPlanner extends SparkStrategies {
val sparkContext: SparkContext = self.sparkContext
val sqlContext: SQLContext = self
def codegenEnabled = self.codegenEnabled
def numPartitions = self.numShufflePartitions
//Convert logicplan into actual operation. The specific operation class is at org apache. spark. sql. Under the execution package
val strategies: Seq[Strategy] =
//Convert cache, set and export commands into actual commands
CommandStrategy(self) ::
//Convert limit to takeordered operation
TakeOrdered ::
//The name is a bit confusing, that is, the conversion aggregation operation
HashAggregation ::
//Left semi join only displays the information of the table on the left when the connection condition is established
//For example, select * from table1 left semi join table2 on (table1. Student_no = table2. Student_no);
//It only displays students in table1_ No is the information in Table 2, which can be used to replace the exist statement
LeftSemiJoin ::
//Equivalent connection operation, some optimized contents, if the size of the table is less than spark sql. Bytes set by autobroadcastjointhreshold
//It is automatically converted to broadcasthashjoin, that is, caching the table, similar to hive's map join (the order is to judge the right table first and then the right table).
//The default value of this parameter is 10000
//In addition, during internal connection, the size of the left table and the right table will be judged. The shuffle will take the data, the large table will not move, and the data will be pulled from the small table for calculation
HashJoin ::
//Execute the select statement in memory for filtering and caching
InMemoryScans ::
//Parquet related operations
ParquetOperations ::
//Basic operation
BasicOperators ::
//Unconditional connections or inner connections are Cartesian products
CartesianProduct ::
//Connect nestedloop to broadcast connection
BroadcastNestedLoopJoin :: Nil
......
}
This step is to convert the logical plan into a physical plan, or an execution plan. There are many concepts I haven’t heard of before. I didn’t know until I checked it onlinedatabaseThere are still so many statements about the implementation plan of the database, which needs to be understood by those who specialize in the database. The remaining two steps are prepareforexecute and execute.
The prepareforexecution operation is to check whether the distribution in the physical plan meets the requirements of partitioning. If it does not meet the requirements, you need to re partition and add a shuffle operation. This is not understood for the time being and needs to be studied carefully in the future. Finally, call the execute method of sparkplan. Here we will talk about the tree structure.
![[transfer] detailed explanation of the parsing process of spark SQL initial experience [transfer] detailed explanation of the parsing process of spark SQL initial experience](https://imgs.developpaper.com/imgs/sa8a2grt5e.png)
SQL parses out a binary tree structure, which is used for both logical plans and physical plans. Therefore, in the code, you can see that the specific implementation classes of logicplan and sparkplan inherit the three types of nodes in the above figure.
The execute method of sparkplan that is not a leafnode will have such a sentence as child Execute (), because it needs to execute the execution of child nodes to return data. The execution process is a pre order traversal.
Finally, this process is also represented by a diagram, which is convenient for memory.
![[transfer] detailed explanation of the parsing process of spark SQL initial experience [transfer] detailed explanation of the parsing process of spark SQL initial experience](https://imgs.developpaper.com/imgs/ay1btedrfg.png)
(1) A parser is used to convert SQL statements into unresolved logicplan. At present, there are two kinds of parsers, sqlparser and hiveql.
(2) Resolve the unresolved content in logicplan into resolved through analyzer, including table name, function, field, alias, etc.
(3) Filter out some garbage SQL statements through optimizer.
(4) Use strategies to convert logical plans into physical plans that can be executed. Specific classes include sparkstrategies and hivestrategies.
(5) Use the prepareforexecution method to check before execution.
(6) Traversal first, and call the execute method of the execution plan tree.