At present, there are many SQL engines based on Hadoop, which can be summarized into two types of systems
（1）Converting SQL to MapReduce. The typical representative isApache HiveThis system is characterized by good expansibility and fault tolerance, but low performance. In order to make up for the deficiency of SQL on MapReduce, Google proposedTenzing(see reference ). Different from hive, Tenzing fully draws on the advantages of MapReduce and database. Firstly, it optimizes the traditional MapReduce (for example, map can not write disk, reduce can not have to sort, etc.) to improve its performance. One of the advantages of MapReduce is that Tenzing has good scalability and fault tolerance
“Thanks to MapReduce, Tenzing scales to thousands of cores and petabytes of data on cheap, unreliable hardware. We worked closely with the MapReduce team to implement and take advantage of MapReduce optimizations.”
Secondly, it uses the advantages of traditional database for reference and embeds a cost based optimizer to fully optimize the SQL query plan.
（2）Using distributed database for reference. The typical representative isGoogle Dremel、Apache DrillandCloudera ImpalaThis kind of system is characterized by high performance (compared with hive and other systems), but poor scalability (including cluster scale expansion and SQL type support diversity) and fault tolerance
“Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations.”
In other words, Dremel is not used to replace Mr, but to make up for the deficiency of Mr. it is usually used to analyze the data generated by Mr (the amount of data is small, and the requirements for SQL expression ability and framework fault tolerance are low when processing these data).
Apache tajo(see reference   for details.),Tajo PPT download，Tajo paper Download）YesDatabase Laboratory of Korean UniversityThe open source distributed data warehouse based on yard is the second level project of Apache. The design idea of Tajo is similar to Tenzing. It fully draws on the advantages of MapReduce and database, and makes it have the advantages of hive’s good scalability and fault tolerance, but at the same time, its performance is much higher than hive’s.
2. Tajo design architecture
Tajo adopts the master worker architecture as follows:
(1) Tajomaster: provides query service for clients and manages various query masters.
(2) Querymaster: responsible for the parsing, optimization and execution of a query. It works with multiple task runner workers to complete the calculation of a query.
As shown in the figure below, Tajo uses traditional database technology to develop SQL parser, including SQL parsing, generating query plan, optimizing query plan, executing query technology, etc., but different from traditional database, Tajo uses MapReduce design idea for reference when executing query plan, which transforms query plan into a series of tasks. In this way, executing query plan is actually executing Each task is a unit of calculation. Like map task and reduce task, it can be repeatedly executed and have progress report. In this way, Tajo can directly use the fault tolerance, speculative execution and other mechanisms in MapReduce. In addition, Tajo uses yarn for resource management.
I wrote in the previous blog postApache tez: a computing framework supporting DAG jobs running on yarnTez is introduced in, in which hive + tez is mentioned. Hive optimized by tez is a very promising project. In addition, Tajo also mentioned that tez can be used as the underlying computing framework in the future
Besides, Tez has some overlapping functions with Tajo. However, Tez is in the pre-alpha stage and may be a prototype. When Tez becomes feasible, Tajo could use Tez as an underlying framework according to the applicability. However, Tajo will still use its row/native columnar execution engine and its optimizer. Tajo may be potentially the first application of Tez.
It’s systems like Tenzing or Tajo that can really replace hive, not systems like Dremel or impala. The latter is far worse than hive / Tenzing / Tajo in scalability, SQL expression ability (mainly caused by its nested storage model) and fault tolerance. As described in Dremel’s paper, Dremel is usually used in combination with MR, and the design motivation is not to replace Mr, but to make the calculation more efficient in some scenarios. In addition, Dremel and impala are computing systems. They need computing resources, but they are not integrated into the current rapidly developing resource management system, yarn. This means that if you use systems like impala, you can only build an independent proprietary cluster, and you can’t share resources. Even if impala is mature, if the substitute of hive (such as Tajo) is not mature, for a long time, most companies still mainly use hive (at this time, hive + tez of hortonworks will be useful) for big data processing, while impala is only used to further process the output of hive or for a certain kind of application (after all, the SQL expression of this kind of system) Limited capability, poor fault tolerance and scalability).
As far as Tajo is concerned, its activity is very low at present. Only a few people in the database Laboratory of Korean university are developing it. It is still a long time before it can be used. However, it has taken the first step, that is, it has become an Apache project to let more people participate in it.
- Tajo’s slide
- Tajo: A Distributed Data Warehouse System on Large Clusters.
- Tenzing: A SQL Implementation On The MapReduce Framework
- Dremel: Interactive Analysis of Web-Scale Datasets