Introduction to the technical architecture of Apache trafodion

Time:2019-10-21

About trafodion

Trafodion is a relational database based on Hadoop / HBase, which is completely open-source and free. Trafodion can fully support ANSI SQL and provide acid transaction guarantee. Different from the traditional relational database, trafodion can provide high scalability by using the horizontal expansion ability of the underlying Hadoop. Traditional databases, such as mysql, are difficult to handle when the data volume reaches the P level. Trafodion, on the other hand, can take advantage of HBase’s extensibility to increase the computing and storage capacity of ordinary Linux servers and support big data applications.

For example, if the data volume of the original MySQL users continues to increase, they often need to use the front and back end cache, database and table splitting, read-write separation and other technologies. But these technologies also bring many disadvantages. For example, in the framework of sub databases and sub tables, join operations cannot be performed between different sub databases. After using these complex technologies, the system structure is complex and the cost of maintenance and development is increased. This is a problem many customers are facing.

From the perspective of using development, trafodion and MySQL are exactly the same, they are also relational databases, and their basic functions are completely the same. So a classic lamp network application can also be easily built with LATP (Linux, Apache, trafodion, PHP). With trafodion, when the business is expanded, it can cope with the increasing amount of data by adding nodes. The application program does not need to make any changes, and does not need to consider the complex technologies such as database splitting, read-write separation, etc. This greatly reduces the complexity of the system.

This is just one of the possible applications of trafodion, which is also a very suitable real-time big data analysis platform. Because it can not only support real-time analysis, but also support real-time data writing, such as tens of thousands of random data inserts per second. This is a necessary ability to build real-time analysis. Stinger or impala can provide real-time query, but cannot support real-time data insertion.

For example, traffic real-time analysis, using stinger / impala and other technologies, although the query and analysis can be completed in one minute, but the data can only be loaded on a regular basis. If it is done once an hour, the data sample analyzed is the data one hour ago, and the analysis results also lose the timeliness. For example, users have been stuck in traffic for an hour.

Readers of trafodion usage scenarios can refer to other series of articles about trafodion. This paper briefly introduces the technical architecture of trafodion to help readers understand the basic principles of internal operation of trafodion.

Readers can also refer to https://wiki.trafodion.org/wi….

overall structure

Trafodion’s architecture can be divided into three layers: ODBC access layer, SQL compile execution layer, data access and storage layer. Its overall structure is as follows:

Introduction to the technical architecture of Apache trafodion

The client application accesses trafodion through JDBC / ODBC. The access layer of trafodion is responsible for customer connection. The access layer allocates a master actuator for each client connection, and the master is responsible for the execution and result return of all query requests connected by users. For a simple query, the master process itself acts as the SQL execution layer; for a complex query, when accessing a large amount of data and performing complex operations, the master will start a series of ESP (executor server processes) processes for large-scale concurrent execution. ESP process can be resident in memory to avoid startup cost, but if it is idle for a long time, esp process will exit and release resources. Each ESP will return the execution result to the master, which will summarize and return the final result to the client. When the master or ESP needs to access the data layer, it will conduct transaction management through DTM. Under the control of DTM (distributed transaction manager), it will call the HBase client API to read and write the data. Here are more details of each layer.

Trafodion’s access layer

The main components of access layer are dcsmaster and mxosrvr. The DCS master process runs on a single node of the trafodion cluster and is responsible for listening to the connection requests of clients. After receiving the request, dcsmaster selects mxosrvr on a node in the cluster as the execution agent of the client according to the workload balance of the cluster. DCS master will return the selected mxosrvr information to the client. After receiving the information, the client will directly connect with mxosrvr. After that, all requests of the client will be processed by the mxosrvr. Similar to Oracle’s dedicated mode.

When multiple clients request connection, dcsmaster will connect the clients to different mxosrvr on average, so as to make balanced use of each computing node in the cluster. In addition, each client has a separate mxosrvr responsible for the execution of its subsequent calculation requests to ensure fast response to customer query. Some database systems only have a single ODBC access point, and in the case of high concurrency, there will be queuing phenomenon. With the above model, each client is solely responsible by an access point, and these access points are evenly distributed in each node of the cluster, which can give full play to the ability of each computing node.

In order to reduce the delay, trafodion will start a certain number of mxosrvr processes at each node in advance. In this way, when the client connection request is processed, there is no need to start the new mxosrvr process. However, trafodion does not pre start a lot of mxosrvr to avoid wasting resources when there are few connection requests. When the number of customer requests is greater than the number of pre started mxosrvr processes, DCS master will start a new mxosrvr for new connection requests to meet the high concurrent customer connections.

DCS master is the only access point for all clients, so trafodion provides ha protection for it. When the DCS master fails to exit or its node crashes, trafodion will restart a new DCS master on other healthy nodes in the cluster, and use floating IP technology to ensure that the client can continue to connect. The whole process is completely transparent to the client.

Trafodion’s ha mechanism is very complex and needs to be introduced in detail in a separate article, which will not be described here.

SQL compile execution layer

After the client request is accepted, each ODBC client has a separate mxosrvr. The mxosrvr is the master process, which is responsible for the execution of user query. The execution process of a user query is roughly as follows:

First, mxosrvr calls the compiler module to compile and optimize SQL statements. Trafodion has a very mature SQL compiler. After 20 years of continuous enhancement and improvement, it has formed a powerful cost based optimizer, which can generate the best execution plan of user SQL, such as the best join table order. In addition, the compiler has an execution plan cache. If the execution plan of SQL is already in the cache, the plan will be returned immediately, which saves the compilation cost.

The execution plan guides the master on how to execute user query. For a simple query, the execution plan only needs the master to complete. For complex query, the master will start multiple ESP processes according to the plan and execute query concurrently. Trafodion’s executor is a concurrent processing model of MPP architecture. Most of its execution operators support concurrency, such as concurrent join, concurrent aggregation, and so on.

Trafodion compiler

The main responsibility of trafodion compiler is to parse SQL text into an optimal execution plan. It mainly includes the following parts:

Parser: parser uses bison to parse SQL text and generate syntax tree. The parser is also responsible for maintaining the execution plan cache. If you can decide in this step that the SQL text you enter is in the cache, you will directly return to the execution plan.

Binder: binder further analyzes the syntax tree, similar to the semantic analysis of the program compiler, and further checks the syntax qualified SQL. For example, check whether the table exists and whether the column data type matches. Binder also maintains the execution plan cache.

Normalizer: normalizer optimizes the syntax tree generated by binder logically. Implement traditional rule-based optimization, such as pushing down query conditions, modifying subqueries to semi join, converting distinct to group by, etc.

Analyzer: analyzer complements the syntax tree to help the optimizer determine if certain rules can be applied. For example, there are many ways to access the underlying data partition, which can be accessed directly from the base table or from the index. The analyzer collects the index of the data table and adds it to the syntax tree for the optimizer to choose.

Optimizer: it can be said that this is one of the core technologies that trafodion is most proud of and concerned about. The optimizer adopts cascades framework, which is a cost based optimizer. Moreover, cascades framework is very easy to expand. Developers can add new rules to expand new optimization methods. In fact, the optimizer can be seen as a search process for problem space. For the same query, many equivalent execution plans can be generated by rules. For example, a simple rule, such as ajoin B = > b join a, can be applied to generate two different equivalent plans.

The optimizer applies various rules to the syntax tree, generates different execution plans, and forms a search space. Then in this search space, we compare the cost of each plan to find the best solution. Due to the large number of rules, the number of equivalent execution plans will increase exponentially, resulting in a huge search space, so it is not realistic to use exhaustive method to compare one by one. The traditional optimizer framework, such as dynamic programming, is a bottom-up strategy, which is difficult to reduce the search space. However, cascade adopts a top-down search strategy, which can easily use branch and bound algorithm to cut some branches, that is, it does not need to further optimize the branches. For example, if the cost of a branch has exceeded the current total cost, no further search will be conducted for that branch.

Cascades also has a memo data structure, which can remember the branches that have been searched, which further increases the efficiency of search.

In addition, the trafodion optimizer has summed up many empirical rules in many years of practice, which can further reduce the search space.

Finally, the optimizer supports the multi pass mode. For a simple query, first enable very few rules and limit the search space to a very small range, so you can find the optimal solution efficiently. For a complex query, enter the second pass, enable all the rules, and further find a better execution plan.

Pre-Code generator: optimizer selects the optimal execution plan. Before generating the physical execution plan, pre codegenerator applies some physical optimization strategies, such as constant folding. For example, assume that where condition is a = 5 and B = a. B = a can be further replaced by B = 5.

Generator: finally, the generator translates the execution plan into a physical execution plan that can be executed by the trafodion executor. Here is an important step to optimize scalar expressions. The so-called scalar expression is the expression whose resolution result is scalar, such as a + B + C. Trafodion uses llvm to compile most scalar expressions into runtime machine code, which further improves the execution speed. Similar to JIT, part of Java bytecode is compiled into machine instructions to speed up the execution of Java programs.

Cost module: the trafodion compiler also has a long-term adjusted and calibrated cost module to estimate the cost of various SQL operators. Cost calculation needs to understand the distribution of data stored in the table, which is supported by the histogram calculated by scanning and sampling the table data. The cost module obtains the data distribution from the histogram and calculates the cardinality. It also considers CPU, memory consumption, message communication, disk IO and other conditions to calculate a cost vector for each SQL operator, providing a more accurate cost estimate.

The above system components work together, as shown in the above figure. After the SQL statement is analyzed by the parser and the normalizer, it is input to the optimizer for cost based optimization; the cost estimation module obtains the data distribution through the histogram, and then estimates the cost according to the characteristics of each operator, and inputs the cost into the optimizer. Based on these inputs, the optimizer finally generates an optimal execution plan.

Trafodion actuator

Trafodion’s executor is a concurrent executor of MPP architecture. Its working mode is data driven, so once the data is ready, it can return to the user without waiting for the entire query to complete execution, which improves the user response speed. The executor is composed of different SQL operators. The data flows through IPC among the operators without saving the intermediate calculation results to disk. If the intermediate data is too large and exceeds the ram capacity, the operator will overflow the data to the disk, so the query execution of trafodion is not limited by the physical memory size.

Concurrent execution

The biggest advantage of trafodion actuators is excellent concurrency. Most SQL operators have the ability of concurrent execution, including group by, join and insert.

Here is a small example to illustrate how trafodion can perform a simple sum (col1) aggregation operation concurrently: the master will start an ESP process at each node of the cluster, which is responsible for sum aggregation operation on the data partition stored on the node. Multiple ESPs are executed concurrently, and the final result is sent back to the master for summary. For aggregation, trafodion can also push the operation down to the data access layer for execution, instead of returning each row of data in the data partition to the ESP, which counts one by one, but the underlying data access layer does the statistical operation, only sending the aggregation results to the ESP, and the ESP returns to the master.

Take a look at trafodion’s join. Trafodion supports all join types, inner join, outer join, non equijoin, semi join, full join, etc. In the implementation of join, nestloop join, merge join and hashjoin are supported. No matter which join algorithm, it has the ability of concurrent execution. Trafodion supports multiple concurrent join methods, and the optimizer selects the best one.

First, we introduce the two most familiar concurrent join algorithms, broadcast and repartition.

broadcast parallel join(hash join)

In the broadcast type join, a table is relatively small and can be completely put into the memory of a single node. In this case, trafodion will broadcast the small table to all nodes. This concurrent execution method is used for hashjoin. ESP on each node puts the small table into memory and establishes the hash table, and then reads the large table partition on this node in order to perform the hashjoin operation.

repartition parallel join

In the join of repartition type, both tables are too large to be put into the single machine memory. In this case, the execution plan generated by the optimizer will automatically derive two layers of esp. after the first layer reads the data, it will repartition the data of the two join tables according to the join column, and the data with the same value of the join column will be collected into the same second layer of ESP to perform the join operation. Then, all the second level ESP will return the join results to the master for summary.

The above two are often used in Hadoop applications, which are called mapper join and reducer join. Both of these methods require a lot of network overhead and memory overhead. The trafodion optimizer can intelligently select the following concurrent join methods when possible:

Matching PartitionsJoin

If the two tables participating in the join are partitioned according to the join column, the local join can be performed directly in the ESP of each node, because there is no need for data on other nodes. This is the ideal situation.

Inner Child ParallelAccess (for Nested Join)

This method is only applicable to nest loop join. TBLA as outer table; TBLB as inner table. TBLA has two partitions, so start two ESPs. Esp1 reads data line by line from TBLA partition 1, and then reads corresponding data lines from TBLB one by one for connection operation. Similarly, esp2 does the same work. This type of join saves more memory than the broadcast method, but multiple ESPs may compete to read the outer table. But it can support non equivalent join.

Trafodion’s MPP concurrent executor also has many other advanced technologies, such as HP’s patented MDAM, adaptive segmentation, skewbuster, etc., which can significantly accelerate query’s execution efficiency and reduce latency, so as to achieve sub second real-time response. Limited to space, MDAM and other technologies will not be described here. Trafodion team will successively launch special technical articles to introduce these patented technologies separately.

Data access layer

When the executor reads and writes the underlying database table, it needs to call the service of data access layer. Trafodion’s data is stored in hbasetable. HBase itself supports random reading and writing of data, but does not support acid transaction processing. Therefore, the data access layer must cooperate with DTM (distributed transaction manager) to achieve transaction protected read and write. Transactions are covered in detail in the next summary.

DTM encapsulates HBase API and adds necessary transaction processing support. The rest of the read-write logic is the same as the native HBase. Therefore, if transactions are not considered, the data access layer is a standard HBase client, accessing HBase through the hbaseclient API. HBase is the core of trafodion data access and storage layer. It is also the most important difference between trafodion and traditional database. With HBase, trafodion can also provide excellent horizontal scalability and strong reliability, which are not available in traditional databases.

Trafodion supports three kinds of underlying database tables: trafodion table, hive table and HBase table. The data access layer is responsible for the access control of these three storage types.

Trafodion table

Trafodion table is a database table created directly by users with trafodion’s DDL statement. At the bottom is an HBase table, so it needs some mapping and coding from the trafodion table to hbasetable.

mapping

That is, how to map the trafodion database table to HBase table. We consider the following trafodion table created by DDL:

create table sales.item(item_id int not null,

                          item_name char(10) ,

                          primary key (item_id));

The first is how to map the name of a relational database to hbasetable. The mapping principle is very simple, that is, the name of a trafodion table stored in HBase is. The item table in the example is mapped toTRAFODION.SALES.ITEMThis hbasetable.

Next is how the columns of the trafodion table are mapped to HBase’s storage schema. There are columnfamilies in the table of HBase. Each columnfamily can have as many columnqualifiers as possible. Each row has a rowkey and a timestamp. These four dimensions define a data cell. So how do trafodion’s two-dimensional tables map to a storage model like HBase?

Trafodion combines the primary key columns of the table as the rowkey of HBase. Column is mapped to HBase’s columnqualifier, and timestamp is used as the timestamp for transaction management. In the current release, all column data is stored in the same columnfamily. Supporting multiple columnfamilies is already in trafodion’s blueprint, so this mapping will change in the future.

Code

HBase has no data type. Trafodion’s table supports different SQL data types, such as char, which is stored by string, “1” is encoded as ASCII code 0x41. If the SQL data type is integer, trafodion will write the binary number 0x00,0x00,0x00,0x01 directly when it is stored in HBase, occupying 4 bytes; the corresponding long type occupies 8 bytes.

Trafodion will automatically process types without the need for the application to encode and decode itself.

Data partition

HBase will automatically partition the data through split technology, but in some cases, such as the case of sequential insertion of time series data, a large number of data reads and writes will be concentrated in a single region, so that the load of a single region server is higher than other region servers. Trafodion supports the slated article function. When creating a table, by specifying the salt keyword, trafodion will automatically add a hash prefix to the rowkey and pre split the table to ensure that the data is evenly distributed in the cluster. Users can also rely on the underlying HBase to automatically partition data without specifying the salt keyword.

Accessing native HBase tables

Trafodion can also directly access the native HBase table, providing two access modes: cell per row and rowwise per row.

Access the HBase table through cell per row, and the cell of each HBase will be used as a row of data in the SQL result set. Accessed through rowwise per row mode, each row of HBase data is a row of data in the SQL result set.

Suppose table1 has two rows of data, two cells in each row: [(row1, CF1: col1, V1), (row1, CF1: col2, V2), (row2, CF1: col1, D1), (row2, CF1: col2, D2)].

Cell per row visit:

select * from hbase.”_CELL_”.”table1”  

 

Return 4 rows of data

value
(row1, CF1:Col1, v1)
(row1,CF1:Col2, v2)
(row2, CF1:Col1, d1)
(row2, CF1:Col2, d2)

Access through rowwise per row:

select * from hbase.”_ROW_”.”table1”;

Return two rows of data

rowkey value
row1 CF1:Col1:v1 CF1:Col2:v2
row2 CF1:Col1:d1 CF1:Col2:d2

Access native hive table

Trafodion has direct access to the native hive table. Using the special schema “hive”, users can access it directly by using SQL statements. such as

select * fromhive.hive.table1;

The SQL Engine will recognize the special schema “hive. Hive”, read the metadata of Table 1 from hive’s Metastore, and then directly access the hive table data on HDFS through libhdfs. Therefore, DTM is bypassed. Therefore, trafodion does not provide transaction protection for access to native hive tables

About affairs

Trafodion’s Welsh intention is transaction, so transaction processing is a very important aspect of trafodion. A transaction is a combination of a series of queries. A transaction consists of several operations, starting with begin and ending with commit or abort.

Trafodion uses a two-phase commit protocol to ensure the integrity of distributed transactions. Each node runs TM processes, and all TM processes are peer-to-peer, which avoids the scalability and singlepoint of failure problems of a single transaction manager. In the case of high concurrency, all active transactions are managed separately by TM on different nodes, which provides high scalability.

The native HBase only supports single line acid transaction guarantee. Trafodion has developed the current version of transaction on HBase mechanism based on the open source project HBase TRX (https://github.com/hbase-trx /…). Provides acid guarantee across rows and tables. HBase TRX adopts mvcc mechanism to provide snapshot isolation transaction isolation level. The native HBase TRX only supports HBase 0.94, and adopts the invasive development method, which greatly modifies the basic code of HBase. The trafodion team absorbed the basic idea of HBase TRX, redeveloped HBase TRX with HBase coprocessor, and supported HBase version 0.98. The implementation of log is improved to ensure the security of data in various failure situations.

At present, the trafodiondtm team is working with the Institute of computing of the Chinese Academy of Sciences to develop a new transaction HBase algorithm, stateful stateless concurrency control (SSCC). For the principle of SSCC, readers can refer to the open source project domino: https://github.com/domino-suc… And it is expected to be available in trafodionr1.2. SSCC provides a higher level of isolation than snapshot isolation, and has a very efficient support for stateless write operations, providing a higher degree of concurrency. Stateless writing is very common in web applications. With this mechanism, trafodion can effectively provide strong support for related web applications.

Summary

Trafodion is a complex large system, and it is impossible for a short article to fully explain its internal operation principle. I just hope to give you a general concept with the simplest description. As an open source project, trafodion welcomes you to study the source code and improve it together.

Through this article, I hope that readers will agree with the following key points:

  • Trafodion has a mature SQL compiler that enables cost based optimization
  • Trafodion has an advanced MPP concurrent execution engine
  • Trafodion has an innovative transaction implementation
  • Trafodion has a mature ODBC / JDBC access layer

Trafodion is built on HBase and inherits all the advantages of HBase. Excellent horizontal scalability for users

There are many technical topics not covered in this article, such as trafodion’s ha implementation, security system, NoSQL support, etc. The trafodion team will strive to improve the documentation. You are also welcome to download the trafodion source code for use and learning, and contribute your understanding and analysis.

Author: Liu Ming, chief architect of yijingjie

Recommended Today

Database lock (notes)

Introduction MyISAMTransaction is not supported. What is the difference between MyISAM and InnoDB in terms of locks MyISAMTable level lock is used by default, row level lock is not supported. InnoDBRow level locks are used by default, and table level locks are also supported. Read lock / shared lock: othersessionCan read, but can’t write. Write […]