[work] Presto research



The team needed an OLAP engine. After comparing various open source systems, they chose presto. To sum up, I may write some articles on source code interpretation, secondary development, problems and tuning in the future.

Our requirement is a unified SQL Engine supporting multiple components, as well as upper layer functions such as resource queue, permission control, SQL monitoring and management. Of course, the faster the performance, the better.

It’s too old. Hive3 has added a lot, such as LLAP, tez engine optimization, more mature transaction features, etc., but it’s inconvenient to support third-party components such as es. The designed storagehandler extension interface is stupid and rigid. Just use its Metastore.. It seems that no matter how hive is optimized, it is the unlucky guy who is casually 10x performance PK by others.

The key point is that the hive table (and HBase / kudu) running HDFS does not support much like other components, and the metadata design is too heavy. Previously, the whole table partition file block and other information were cached and broadcast to the cluster. But the latest 4 The X version seems to be improving this aspect, and the problem is too slow..

As the industry leader of batch processing, sparksql has been used for several years. It is not suitable for being a product of data warehouse. There is no fine-grained permission control (it is not easy to integrate with sentry / Ranger, and the official does not do it). It only makes retrieval related SQL, has no definition of extensible permissions / roles, and has no resource isolation, Single point scheduling (of course, Presto is the same under normal circumstances, but Presto is pipelined scheduling, task is iterator mode, and spark is a task that generates all static sets at once, even if the limit is 10, which has great metadata pressure) and many other places are not suitable for external JDBC export of a large cluster. The datasource extension interface is provided, and the third-party component group depends on the support of individuals or other teams (not provided by the official library, I always feel that the quality cannot be guaranteed). Because spark is essentially a batch processing framework, not a database.. Although the sparksql community has invested a great deal of effort in the unified optimization of datasets.. A single statement may win the performance of sparksql, but the overall functional evaluation is not enough.

Like other Druid, Clickhouse, kylin and Greenplum, they are either a single tool or have poor scalability, so they are not considered..

Introduction to Presto

  • Memory based parallel ⾏ computing, distributed SQL interactive query engine (of course, it also supports spill to disk)
  • Massively parallel processing (MPP) architecture
  • Multi node pipeline execution, performance not less than spark, 10 times faster than hive+
  • It is composed of coordinator and multiple workers
  • Support a large number of components and provide a unified SQL entry. You only need to configure it [key]
  • It has rich database features: queue, priority, permission, resource isolation, webui, etc

[the official team has a document similar to best practices / cook book](https://trino.io/blog/2020/04/11/the-definitive-guide.html)
PS: the connectors supported by the current version 0.258 are

Accumulo connector // kV similar to HBase 
    BigQuery Connector 
    Black hole connector // similar to / dev / null 
    Cassandra Connector
    Druid Connector
    Elasticsearch connector // key use
    Hive connector // key use
    Hive Security Configuration
    Iceberg Connector
    JMX connector // seems to be only the information of this cluster..
    Kafka Connector 
    Kudu Connector
    Local file connector // can it be used as a local database?
    Memory connector // cannot be persisted..
    MongoDB Connector
    MySQL Connector
    Oracle Connector
    Apache Pinot Connector
    PostgreSQL Connector
    Redis Connector
    Redshift Connector
    SQL Server Connector
    System connector // information about this cluster 
    Thrift Connector
    Tpcds connector // quickly build data sets and test
    TPCH Connector

About Presto and Trino

Trino is a new branch opened by the original team because it is inconsistent with the management concept of Facebook. The details of community division are unclear. Anyway, both sides are updating and the community is very active. After two years, the specific code differences have been very large, and it is almost impossible to merge..

It’s been a long time since I chose Trino or presto. After all, Trino is supported by the founding team, and it seems that there are more connectors supported now. I downloaded and browsed the codes on both sides. Trino’s code organization and structure are much better, and the performance comparison is not measured. However, from the roadmap and blog on both sides of GitHub, Presto seems to have stronger R & D power.. (I just feel welcome to discuss)

PS: take ha of coordinator as an example
The Trino community proposed in 2019.. There is still no release

[work] Presto research


Presto has been implemented, and there is raptorx, a layered cache (which has a good performance improvement under the storage computing separation architecture). In addition, Presto plans to do the load balancing of the coordinator, which is not the same as ha, which is in strong demand in large clusters.

[work] Presto research


[introduction to layered cache](https://prestodb.io/blog/2021/02/04/raptorx)

Let’s have an execution diagram of Presto (this kind of thing is almost the same as sparksql, which can also correspond to it)

[work] Presto research


Presto main concepts


Externally, it is responsible for managing the connection between the cluster and the client and receiving the client query request.
SQL syntax parsing, query plan generation and optimization, and query task scheduling.
The management node of the cluster. A discovery server is built in to track the status of the worker node
Deployment: it is generally deployed in the cluster as a separate node
Communication mode: use restful interface to interact with clients and workers


The working node of the cluster. It is used to execute decomposed query tasks and process data
Deployment: generally, multiple worker nodes are deployed in the cluster
Communication mode: use restful interface to interact with coordinator and other workers

Data source connector

Presto can access many different data sources through connector.
Connector is equivalent to the driver of database access.
Each connector realizes the standard access of data source by implementing Presto’s SPI interface.


The catalog can contain multiple schemas and access the specified data source by using the specified connector. For example, configure hive catalog to access hive data sources.


Function: used to manage tables, similar to the database in MySQL. A catalog and a schema can uniquely determine a set of tables that can be queried.


It is similar to the concept of traditional relational database. The mapping from the data source to the presentation is specified by the connector.

Metadata structure diagram

[work] Presto research


Presto query execution model


Presto supports ANSI standard SQL statements, which are composed of clauses, expressions and predicates.
Why does Presto distinguish between statement and query?
In presto, statements refer to the text representation of the SQL statement entered by the user.
When the statement is executed, Presto will create a query execution and query plan to execute the corresponding query,
The query plan is executed distributed on a series of worker nodes.
【This is necessary because, in Presto, 
statements simply refer to the textual representation of a SQL statement. 
When a statement is executed,
 Presto creates a query along with a query plan that is then distributed across a series of Presto workers.】


When Presto receives an SQL statement,
It is converted into a query and a query plan is created.
Among them, the query plan is a series of associated stages running on the pressor workers.
The difference between statement and query: statement is the SQL text input to Presto;
Query is used to instantiate some column configurations and components to execute a statement.
A query contains the concepts of stages, tasks, splits, connectors and other components and corresponding data sources.


When Presto executes query, it will split the query into multiple stages with hierarchical relationship.
For example, when Presto queries 100 million records from hive and aggregates data,
Presto will create a series of stages to execute corresponding distributed queries,
At the same time, a root stage is created to aggregate the query output of the above stages,
Then aggregate the results and output them to the coordinator and further output them to the user.

There is a tree hierarchy between the stages of a query.
Each query has a root stage, which is used to aggregate the output data of all other stages.
Stage is just a logical concept used by coordinator for modeling distributed query plan,
Itself will not be executed on Presto workers.


Presto runs through tasks.
A distributed query plan is broken down into several columns of stages.
A stage is decomposed into a series of tasks executed in parallel.
Each task is decomposed into one or more parallel drivers, and each driver acts on a series of splists.
Each task has corresponding input and output.


A task contains one or more drivers.
Drivers process the data, which is aggregated by tasks and then transmitted to a task in the downstream stage.
A driver is a collection of a series of operators acting on a split.
Driver is the lowest parallel processing unit of Presto architecture.
Each driver has one input and one output.


An operator code is an operation of a split.
One operator reads the data in one split in turn,
Apply the calculations and operations represented by the operator to the split and generate output.
For example, read the operator of the table file.


A split is a small slice of the entire large data set.
The lower level stages in the distributed query plan obtain splits from the data source,
In the middle of the higher level, the stages fetch data from other stages.


Exchange is used for data exchange between different Presto nodes.
Put the task production data into the output cache,
You can also consume data from other tasks through the exchange client.