[calculate] preliminary exploration and concept explanation of Apache calculate framework

Time:2021-12-24

1. Introduction

What is calcium? If you describe calcite in one sentence,Calculate is a basic framework for optimizing query processing of heterogeneous data sources

In recent ten years, there have been many special data processing engines. For example, HBase, Flink, elastic search, and so on. These engines have unique advantages in their respective fields. In the existing complex business scenarios, it is difficult for us to adopt only one of them and abandon other data engines. When the engine develops to a certain mature stage, in order to reduce the learning cost of users, most engines will consider introducing SQL support, but how to avoid repeated wheel building has become a big problem. Against this background,Calcite was born. It provides standard SQL language, multiple query optimization and the ability to connect various data sources, leave the ability of data storage and data management to the engine itself.At the same time, calcium has a good pluggable architecture design, we can use only some of these functions to build our own SQL Engine without relying on compute. Therefore, calculate has become the best solution for many big data framework SQL engines. Our computing engine group also implements a self-use SQL verification layer based on calculate. When users submit Flink SQL jobs, they need to go through a layer of semantic verification first, and then use the metadata obtained from the verification to build a template. The task is submitted to Flink engine for execution.

Note: at present, the latest official version of calcium is v1.0 27. Flink 1.12 uses compute v1 26. The content of this article is written based on calcium 1.20, but all core contents are not affected by the version.

2. Core architecture

[calculate] preliminary exploration and concept explanation of Apache calculate framework

Figure from《Apache Calcite》

The box in the middle summarizes the core structure of calculate. First, calculate parses an SQL query through SQL parser and validator to obtain an abstract syntax tree (AST). Since calculate does not contain a storage layer, Therefore, it provides another mechanism for defining table schema and view – catalog as the storage space of metadata (in addition, calculate provides an adaptor mechanism to connect with an external storage engine to obtain metadata, which is beyond the scope of this article). Then, calculate generates the corresponding relational expression tree through the optimizer and optimizes it according to specific rules. The optimizer is the most important logic of calculate, which includes three components: rule and metadataprovider (catalog) and planner engine. These components will be explained in detail later in the article.

As we can see from the architecture diagram, The biggest feature of calcite (advantage) it separates the logic of SQL processing, verification and optimization, and omits some key components, such as data storage, algorithms for processing data and the repository for storing metadata. Secondly, the best thing about calcite is its pluggable mechanism. Each big data framework can select the whole or part of calcite’s modules to build its own Existing SQL processing engines, such as hive, have implemented SQL parsing and only used the optimization function of compute. Storm and Flink have established SQL engines based entirely on compute, as shown in the following table:

[calculate] preliminary exploration and concept explanation of Apache calculate framework

Table from《Apache Calcite》

2.1 four stages

[calculate] preliminary exploration and concept explanation of Apache calculate framework

Figure fromIntroduction to Calcite

The operation of the calculate framework is mainly divided into four stages

  1. Parse: use the parser generated by JavaCC for lexical and syntactic analysis to obtain ast;
  2. Validate: validate in combination with metadata;
  3. Optimize: convert the ast into a tree of relational expression and optimize it according to specific rules (heuristic or cost baesd);
  4. Execute: convert the logical execution plan into engine specific execution logic, such as Flink’s datastream.

Considering that step 4 is a process coupled with the engine, we mainly focus on the first three stages.

2.2 four components

Around this running process, the core framework of Apache compute can be divided into four components

  1. SQL parser: convert SQL that conforms to syntax rules into ast (SQL text → sqlnode). Calculate provides a default parser, but you can also generate a custom parser based on JavaCC;
  2. Catalog: defines and records the metadata and namespace of SQL to facilitate subsequent access and verification;
  3. SQL validator: verify AST in combination with metadata provided by catalog. The specific implementation is in sqlvalidatorimpl;
  4. Query optimizer: there are many concepts. First, you need to convert the ast into a logical execution plan (i.e. sqlnode → relnode), and then use rules to optimize the logical execution plan.

3. SQL parser

As mentioned above, SQL parser is used to cut SQL text into tokens and generate ast, Each token is represented by sqlnode in calculate (that is, representing each node of the AST), sqlnode can also regenerate the SQL text through the Unparse method. For convenience of explanation, we introduce an SQL text to find out the principle of calculate by observing its changes in calculate. For subsequent verification and optimization, we will also introduce different SQL texts for analysis according to specific scenarios.

INSERT INTO sink_table SELECT s.id, name, age FROM source_table s JOIN dim_table d ON s.id=d.id WHERE s.id>1;

3.1 SqlNode

Sqlnode is an abstraction of all ast nodes. It may specifically represent an operator, constant or identifier. Many implementation classes are derived from the sqlnode base class, as follows:

[calculate] preliminary exploration and concept explanation of Apache calculate framework

After the insert is parsed by the parser, it will be transformed into a sqlinsert, while the select will be transformed into sqlselect. Taking the above SQL text as an example, the following structure will be obtained after parsing:
[calculate] preliminary exploration and concept explanation of Apache calculate framework

The following explains some common core structures in sqlnode according to the figure.

3.1.1 SqlInsert

Firstly, this is a DDL statement whose action is insert, so the whole ast root is a sqlinsert. There are many member variables in sqlinsert that record the different components of this insert statement:

  • Targettable: records the table to be inserted, that is, sink_ Table, expressed as sqlidentifier in AST
  • Source: identifies the data source. The data source of the insert statement is a select clause, which is expressed as sqlselect in AST;
  • Columnlist: the column to be inserted is null because it is not explicitly specified in the insert statement and will be dynamically calculated in the verification phase.

3.1.2 SqlSelect

Sqlselect is the data source part of the insert statement parsed and generated by the parser. Its core structure is as follows:

  • Selectlist: refers to the column of the query immediately after the select keyword. It is a sqlnodelist. In this example, because the column is explicitly specified and there is no function call, there are three sqlidentifiers in the sqlnodelist;
  • From: refers to the data source of the select statement. In this example, the data source is table source_ Table and dim_ Table, so here is a sqljoin;
  • Where: refers to the where clause. It is a function call about condition judgment. Sqlbasiccall, whose operator is a binary operator >, is resolved to sqlbinaryoperator. The two operands are s.id (sqlidentifier) and 1 (sqlnumberliteral).

3.1.3 SqlJoin

Sqljoin is the part of the sqlselect statement that is parsed and generated by the parser:

  • Left: represents the left table of the join. Because we use an alias, here is a sqlbasiccall. Its operator is as, which is resolved to sqlasoperator, and the two operands are source respectively_ Table (sqlidentifier) and S (sqlidentifier);
  • Jointype: represents the type of connection. All join types that support resolution are defined in org apache. calcite. sql. In jointype, jointype is resolved to sqlliteral, and its value is jointype INNER;
  • Right: represents the right table of the join. Because we use an alias, here is a sqlbasiccall. Its operator is as, which is resolved to sqlasoperator. The two operands are dim respectively_ Table (sqlidentifier) and D (sqlidentifier);
  • Conditiontype: represents the on keyword, which is a sqlliteral;
  • Condition: same as 3.1 2 is a function call about condition judgment, sqlbasiccall. Its operator is a binary operator =, which is parsed as sqlbinaryoperator. The two operands are s.id (sqlidentifier) and D.id (sqlnumberliteral).

3.1.4 SqlIdentifier

Sqlidentifier is translated into an identifier to identify all table names, field names View name (* it will also be recognized as a sqlidentifier). Basically all SQL related parsing and verification ends at the sqlidentifier layer. Therefore, it can also be considered that sqlidentifier is the most basic structural unit in sqlnode. Sqlidentifier has a string list names to store the actual values, which are expressed in columns, because considering the fully qualified names, such as s.id, in names It will occupy two element grids. Names [0] stores s and names [1] stores ID.

3.1.5 SqlBasicCall

Sqlbasiccall contains all function calls or operations, such as as as, cast and other keywords and some operators. It has two core member variables: operator and operands, which respectively record the operator and operand of the function call / operation. The operator identifies its type through sqlkind.

3.2 JavaCC

Instead of making its own wheels for lexical and syntactic analysis, calcite uses the mainstream framework JavaCC and combines FreeMarker template engine to generate LL (k) parser,JavaCC (Java compiler) is a Java syntax analysis generator written in Java language. The files it generates are pure java code files。 As long as users write JavaCC source files according to JavaCC syntax specifications, and then use JavaCC plug-in for CodeGen, they can generate a language specific analyzer based on Java language.

FreeMarker is a template rendering engine. It can quickly generate the syntax description file we want by establishing built-in templates and combining with custom extended syntax.

In calculate, parser JJ is the built-in template file of calculate FTL is a user-defined extension template, config FMPP is used to declare the data model. First, calculate generates the final parser through the FMPP Maven plugin plug-in JJ file, and then use the JavaCC Maven plugin plug-in to generate the corresponding Java implementation code. The specific flow chart is as follows:[calculate] preliminary exploration and concept explanation of Apache calculate framework

4. Catalog

The catalog stores the metadata and namespace of the whole SQL. The metadata verification needs to be carried out through the catalog component. The key structures in the catalog are as follows:

Interface / class remarks
Schema The namespace of tables and functions is a multi-layer structure (tree structure). Although the schema interface stores metadata, it only provides an interface for query and resolution. Users generally need to implement this interface to define the registration logic of metadata.
SchemaPlus The extended interface of schema,It provides additional methods to explicitly add table data。 Designers want users to use schemaplus to register metadata, but instead of making a new implementation of schemaplus, they directly use the implementation class provided by calcite.
CalciteSchema Wrap user-defined schema.
Table The most basic metadata is usually obtained through the gettable of schema.
RelDataType Represents the type of result (row) returned by a scalar expression or a relational expression.
RelDataTypeField Represents the structure of a column field.
RelDataTypeSystem Provides some limiting information about types, such as precision, length, and so on.
RelDataTypeFactory Abstract factory pattern, define various methods to instantiate SQL, Java and collection types, and create these types to implement the reldatatype interface.

These structures can be roughly divided into three categories:

  • Metadata management mode and namespace;
  • Table metadata information;
  • Type system.

The catalog structure of calcite is complex, but we can understand catalog from this perspective. It is different levels of abstraction of metadata made by calcite at different granularity. First, the most fine-grained is the reldatatypefield, which represents the name and type information of a field. Multiple reldatatypefields form a reldatatype, which represents the type information of the result of a row or a scalar expression. Then there is the metadata information of a complete table, that is, table. Finally, we need to organize and store these metadata for management, so there is schema.

5. SQL validator

The validator process provided by calculate is extremely complex, but in summary, it mainly does one thing,Check whether the metadata of each sqlnode is correct, including:

  • Verify whether the table name exists;
  • Whether the selected column exists in the corresponding table and whether the matched column name is unique. For example, join multiple tables and two tables have fields with the same name. If the selected column does not specify a table name, an error will be reported;
  • If it is an insert, you need to insert columns and data sources for verification, such as column number, type, permission, etc;
  • ……

[calculate] preliminary exploration and concept explanation of Apache calculate framework

The validator provided by calculate is closely related to the previously mentioned catalog. Calculate defines a catalogreader to access metadata (table schema) during verification, and encapsulates the metadata at runtime. The two core parts are sqlvalidatornamespace and sqlvalidatorscope.

  • Sqlvalidatornamespace: describes the relationship returned by SQL query. An SQL query can be divided into multiple parts, such as query column combination, table name, etc. each part has a corresponding sqlvalidatornamespace.
  • Sqlvalidatorscope: it can be considered as the working context of each sqlnode in the verification process. When the expression is verified, it is resolved through the resolve method of sqlvalidatorscope. If it is successful, the corresponding sqlvalidatornamespace description result type will be returned.

On this basis, calculate provides the sqlvalidator interface, which provides all the core logic related to verification, and provides the built-in default implementation class sqlvalidatorimpl, which is defined as follows:

public class SqlValidatorImpl implements SqlValidatorWithHints {
    // ...
  
  final SqlValidatorCatalogReader catalogReader;
  
  /**
   * Maps {@link SqlNode query node} objects to the {@link SqlValidatorScope}
   * scope created from them.
   */
  protected final Map<SqlNode, SqlValidatorScope> scopes =
      new IdentityHashMap<>();

  /**
   * Maps a {@link SqlSelect} node to the scope used by its WHERE and HAVING
   * clauses.
   */
  private final Map<SqlSelect, SqlValidatorScope> whereScopes =
      new IdentityHashMap<>();

  /**
   * Maps a {@link SqlSelect} node to the scope used by its GROUP BY clause.
   */
  private final Map<SqlSelect, SqlValidatorScope> groupByScopes =
      new IdentityHashMap<>();

  /**
   * Maps a {@link SqlSelect} node to the scope used by its SELECT and HAVING
   * clauses.
   */
  private final Map<SqlSelect, SqlValidatorScope> selectScopes =
      new IdentityHashMap<>();

  /**
   * Maps a {@link SqlSelect} node to the scope used by its ORDER BY clause.
   */
  private final Map<SqlSelect, SqlValidatorScope> orderScopes =
      new IdentityHashMap<>();

  /**
   * Maps a {@link SqlSelect} node that is the argument to a CURSOR
   * constructor to the scope of the result of that select node
   */
  private final Map<SqlSelect, SqlValidatorScope> cursorScopes =
      new IdentityHashMap<>();

  /**
   * The name-resolution scope of a LATERAL TABLE clause.
   */
  private TableScope tableScope = null;

  /**
   * Maps a {@link SqlNode node} to the
   * {@link SqlValidatorNamespace namespace} which describes what columns they
   * contain.
   */
  protected final Map<SqlNode, SqlValidatorNamespace> namespaces =
      new IdentityHashMap<>();
  
  // ...
}

You can see that there are many scopes mappings (sqlnode – > sqlvalidatorscope) and namespaces (sqlnode – > sqlvalidatorspace) in sqlvalidatorimpl. Verification is actually the process of verifying sqlvalidatorspace in each sqlvalidatorscope. In addition, sqlvalidatorimpl has a member catalogreader, that is, the sqlvalidatorcatalogreader mentioned above, Sqlvalidatorimpl provides an entry to access metadata.

6. Query optimizer

Query optimizer is the most complex component, involving many concepts. Firstly, query optimizer needs to convert sqlnode into relnode (sqltoreconverter), optimize it with a series of relational algebra optimization rules (reoptrules), and finally convert it into a physical plan executable by the corresponding engine.

6.1 sqlnode to relnode

SQL is a DSL based on relational algebra, and the relnode interface is an abstract representation of relational algebra by calculate. All code structures of relational algebra need to implement the relnode interface.

Sqlnode is a node parsed from the perspective of SQL syntax, while relnode is an abstract structure of relational expression. It represents its logical structure from the perspective of relational algebra and is used to determine how to execute queries in the subsequent optimization process. When sqlnode is converted into relnode for the first time, it is composed of a series of logical nodes (logicalproject, logicaljoin, etc.), and the subsequent optimizer will convert these logical nodes into physical nodes. There are different implementations according to different computing and storage engines, such as jdbcjoin, sparkjoin, etc. the following table is a mapping relationship about SQL, relational algebra and calculate structure:

SQL Relational algebra Calcite
select Projection (project) LogicalProject + RexInputRef + RexLiteral + …
where Select LogicFilter + RexCall
union Union LogicalUnion
Inner join without on Cartesian product LogicJoin
Inner join with on Natural join LogicJoin + RexCall
as Rename RexInputRef

Note: the calculate column only lists the more common cases, rather than the strict mapping criteria with the previous two columns.

The results of a simple SQL example processed by query optimizer are as follows:

[calculate] preliminary exploration and concept explanation of Apache calculate framework

Figure fromIntroduction to Calcite

6.2 relnode optimization

Class / interface remarks
RelOptNode It represents the expression node that can be operated by the planner
RelNode Relational algebra, relnode, is an abstract structure of relational expression, which inherits the reoptnode interface. Sqlnode is a node parsed from the perspective of SQL syntax, while relnode represents its logical structure from the perspective of relational algebra and is used to determine how to execute queries in the subsequent optimization process. When sqlnode is converted into relnode for the first time, it is composed of a series of logical nodes (logicalproject, logicaljoin, etc.), and the subsequent optimizer will convert these logical nodes into physical nodes. There are different implementations according to different computing and storage engines, such as jdbcjoin, sparkjoin, etc.
RexNode Row expressions, representing a row expression, represent the operations to be performed on a single row, and are usually contained in relnode. For example, the project class has a member list < rexnode > Exps, which represents projected fields (i.e. query fields in SQL), and the filter class has a member rexnode condition, which represents specific filter conditions. Similarly, it can also act as join conditions, sort fields, etc
RelTrait Trail indicates some physical characteristics of relnode, which will not change the execution result. The three main physical features are Convention: the calling convention of a single type of data source, and each relational expression must run on the same type of data source; Correlation: the data sorting defined by the relationship expression; Reldistribution: data distribution characteristics.
Convention It is a kind of trail, which represents a single class data source.
Converter A relnode implements this interface to indicate that it can convert the value of one trail into another.
RelOptRule Rules are used to optimize query plans. Rules can be divided into two categories,Converters: inherit converterrule and convert one convention into another without changing semantics; Transformers: match query plans and optimize them.
RelOptPlanner There are two kinds of planners, hepplanner: heuristic optimizer, which matches each node of relnode with the registered rule traversal, and optimizes if the rule is successfully matched; Volcanoplanner: cost based optimizer, which selects the scheme with the lowest cost for optimization in each iteration.
RelOptCluster Context of planner runtime

After converting sqlnode to relnode, we can optimize relnode through some rules of relational algebra. These “rules” are represented as reoptrule in calculate.

Relnode has some physical features, which are represented by reltrain, One of the most important is the calling convention, which can be understood as a specific data engine protocol or data source convention. The relnodes of the same data engine can be directly connected to each other, while those of different engines need to be converted through the converter (matched through the converterrule).

Common optimization rules include:

  • Cut out unused fields;
  • Merge projects;
  • Convert sub query to join;
  • Reorder joins;
  • Push down projects;
  • Push down filters;
  • ……

7. Application scenarios

Based on the good pluggable characteristics of calcite, there are many SQL parsing engines based on secondary development of calcite, such as Flink. This section lists some work and ideas that can be expanded based on calcite.

7.1 expand SQL syntax parsing

Implement lexer and parser based on JavaCC. For example, Flink customized and expanded sqlcreatetable and sqlcreateview parsers on the native parser.jj template of calculate to support parsingCREATE TABLE ...andCREATE VIEW ...At the same time, we need to expand the corresponding Java classes..

7.2 expand the user-defined data structure of metadata verification

By expanding schema, table and other interfaces, you can customize the timing and format of metadata injection, For example, Flink establishes the data dependency of nested view through imperative programming (assuming that viewa depends on viewb’s data, you need to manually call the API to parse viewb). Some frameworks read in batches and establish their own topology to solve the problem of data dependence. Regarding the metadata format, Flink implements queryoperationcatalogviewtable based on the table interface to represent tables and views, and designs a unified tableschema for it to collect reldatatype information.

7.3 analysis of expansion types

When you encounter some complex types that are not supported by the native language of calculate, you can expand type resolution by expanding reldatatypefactory and other related type interfaces.

7.4 expand specific rule optimization

You can customize some special rules and call the addruleinstance method of hepprogrambuilder to register in the planner, which can match our custom rules in the optimization process of relnode and optimize them when they are successfully matched.

From the perspective of process complexity, these scenarios are SQL syntax parsing > metadata validation > type parsing > rule optimization. However, in the actual expansion process, I think the order of difficulty is opposite, because although the processes of SQL syntax parsing and metadata validation are very complex, they are well encapsulated, and there are many adaptation points to be considered in type parsing, Rule optimization requires a deep foundation of SQL and some theoretical knowledge. The actual expansion process is his most difficult.

8. References

  1. Calcite concept note
  2. Detailed explanation of Apache compute processing flow
  3. https://zhuanlan.zhihu.com/p/…
  4. Relnode and rexnode
  5. Introduction to Apache Calcite
  6. Apache Calcite:A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources