Interpretation of Presto / Trino authoritative guidelines and official design documents

Time:2022-1-7

Official website address:https://trino.io/docs/current/
See the bibliography Trino: the definitive guide
Open source community blog address:https://blog.starburstdata.com/
Trino blog address:https://trino.io/blog/
The authors are all the three founders of presto, so these two materials plus the Presto paper in 2019 (see my other blog) are the most authoritative Presto technical documents. However, after reading by the editor, these are not fine enough. More technical documents are described in the blog of the open source community. The editor will add these information to the corresponding positions to enable the readers to obtain the latest technical understanding.
Trino is the alias of presto. Historical reasons are not detailed. For the convenience of narration, it is hereinafter collectively referred to as presto.
When this article was written, it was the version of Trino 356. Writing date: May 3, 2021
Presto’s basic terms and concepts, installation tutorials and uses will not be repeated in this article. There are many relevant documents / tutorials / blogs on the Chinese Internet, which can be consulted step by step. The technical depth of this article is between user and contributor. It does not involve reading the source code, but only focuses on the idea and trade-off of software design. After that, if you have the opportunity, you will interpret the source code and update the blog.

Chapter IV Presto’s architecture

Presto is a typical master-slave architecture. The summary composition is as follows.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

4.3 node discovery service

When the node starts, it will report / communicate to the coordinator discovery service, and then regularly send a heartbeat to the coordinator to prove its survival.
This node discovery service is dedicated to tracking surviving nodes and maintaining the list of worker nodes. This service is embedded in presto, so the worker node can communicate directly with the HTTP service of the coordinator.

4.4 connector based architecture

Connector is used to connect external data sources. As long as the external data can be represented as rows, columns and tables with the data types supported by presto, you can create a connector to connect with the external data source.
Presto provides SPI (service provider interface) to implement a connector. As long as the connector implements SPI, it can connect Presto engine and external data. SPI needs to implement three parts:

  • Operation of obtaining metadata of table / view / schema;
  • The operation of logical units generating data partitions. Based on these logical units, Presto can read and write in parallel;
  • The operation of converting between source data and presto memory data format.

Coordinator and worker need different interfaces provided by SPI, as shown in the following figure:

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Presto dynamically loads these connectors in the form of plug-ins. The plug-in architecture is widely used in presto, such as event listener, data type, function type, access control, etc.

4.7 query execution model

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Before generating a physical plan, multiple SPI interfaces are used to assist in specifying the plan:

  • Metadata SPI can perform semantic verification, type check, security check, etc. in the parse process;
  • The data statistics SPI provides statistical information of some tables when the plan is specified, based on which the cost based query optimization is formulated;
  • Data location SPI provides data location when making physical plans to facilitate scheduler slicing.

The source data generates data in the form of page, which is multiple rows stored in column format.
Task is actually a stage with a specific split on a specific worker node. When a task is created, a driver will be initialized for each split. Each driver is an instance of an operator pipeline and is responsible for processing the data in the split. A task may create multiple drivers.
Oprertor, that is, operator, the basic unit on the pipeline, represents a calculation and processing process. Common include tablescan, filter, join, aggregator, etc.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Firstly, the coordinator creates a splits list according to the metadata SPI and assigns tasks based on the list. During query execution, the coordinator tracks the execution progress of all tasks and splits. Some workers complete tasks and generate more splits to be processed downstream, so the coordinator continues to assign tasks.

4.8 query optimization

Here we explain query optimization based on a specific SQL example.
As shown in the following figure, this is an SQL statement for statistics of sales volume based on region and country. It is completed based on country table, region table, order table and user table. It is an SQL on tpc-h. we take it as an example.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

4.8.2 initial query plan

According to the literal meaning, we directly specify the query plan for the above SQL to obtain the following tree view. Such a tree view can perform results, but under simple evaluation, the complexity of two crossjoins isLevel operation, such a query plan cannot be completed in the lifetime of human beings. The task of the query optimizer is to transform such an initial query plan into an equivalent query plan, which can complete the task more efficiently. Theoretically, the equivalent query plan has exponential levels. The optimizer’s task is to select the best query plan within a certain time.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

4.9-4.10 optimization rules / heuristics

Optimizer optimizes the initial query plan according to some optimization rules. We can call it heuristic method.

4.9.1 predicate push down

Predicate push down is easy to understand. Push down the predicate as close to the leaf node as possible, so that records that will not be generated in the result will not be processed too much during execution.
In the example just now, we can merge the parts of the Crossjoin operator and the filter operator to become an inner join and push down the predicate.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

After the predicate is pushed down, the complexity of the query plan is reduced to

4.9.2 cross join elimination

Cross join generally has no business meaning. It will only be some newly written SQL statements. Cross join is generally very expensive and difficult to complete. Therefore, cross join should be specifically eliminated.
The elimination of cross joins is mainly to reorder the table order of joins by using the filter’s filter conditions, trying to turn all cross joins into non cross joins.
In the example just now, we can push down all the two predicates through table order rearrangement to eliminate cross join.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

After cross join, the complexity of the query plan is no longer dominated by the join. The complexity of the join is

4.9.3 TopN

The combination of order by + limit is very common in SQL usage. Naive’s idea is to sort first and then take the first n records. The sorting is very heavy, so the complexity is high. The distributed query algorithm using heap for topn has been relatively mature, so the operation of order by + limit will generally be simplified as topn operator.
After the simplification of topn, sorting is no longer a speed limiting step, and the sorting complexity is

By the way, in SQL, order by is only useful if it is followed by limit or fetch first. In other cases, Presto directly ignores order by.

4.9.4 local polymerization

Generally speaking, the tables participating in the join do not use all its details. Especially when there is a group by query for aggregation, only the statistical (aggregate) information of a limited number of columns of the join table is usually used. Then we can pre aggregate the relevant columns of the join table before joining, and then join to reduce the traffic in the process of data transmission.
In order to improve parallelism, pre aggregation is usually implemented by local aggregation. When local aggregation can significantly reduce the amount of data transfer, it can significantly improve the performance, but if it can not effectively reduce the amount of data, it may cause negative effects. (therefore, Presto closes this rule by default and can be opened manually)

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

4.10.1 decorrelation of literal join

Left join is the associated subquery at the select position. For this simplified writing method, Presto parser will first rewrite it into the standard left join form, as shown in the following figure:

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

This rewriting is not semantically consistent, especially in null values and duplicate values. Therefore, in order to ensure semantic consistency, in the query plan, unique IDs should be allocated to the original query results to distinguish them, and the join results should be tested for repeatability. Once there is duplication, an error should be reported.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

4.10.2 semi join (in) disassociation

This refers to the associated sub query, that is, within the in sub query, the information of the external table needs to be used, that is, the association is generated.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Presto will remove this association. First, let the sub query remove the external association conditions (in this example, yes)WHERE p.partkey=l.partkey)Then, calculate the result once, and then use the result and the external association table to perform inner join. Finally, filter according to the key of the in. Finally, you need to manuallyduplicate removalGet the only result.

By the way, Presto uses a hash table to complete the unrelated semi join.

4.11 cost based optimizer

An important feature of cost based optimizer (CBO): not only consider the “shape” of query statement, but also consider the characteristics of input data.
Presto’s cost consists of three parts: CPU time + memory demand + network bandwidth occupation.

4.11.2 cost of join

Presto can perform distributed join based on an extended hash join algorithm. The main idea is to divide a pair of join tables into build side and probe side. First, build a hash table based on the join key of build side, and then probe side and hash table to match the successful emit. Specifically, it is divided into three stages to form a three-layer hash to support parallelism:

  1. Firstly, according to the hash range of the join key, send the data of the build side and probe side tables to each worker node, which is the first layer hash f (x);
  2. Within each worker node, hash the build side to each thread again according to the join key, which is the second layer hash g (x);
  3. Within each thread, build a hash table for the join key of build side, which is the third layer hash H (x). After the third layer hash table is built, it is handed over to the worker node for merging; After the merge is completed, the probe side hashes the join key according to h (x) and looks it up in the local hash table to return the results. This detection process can directly divide the probe side data equally to each thread for search.

You may wonder why the third layer hash should be merged? Distributed query can also be performed without merging, that is, the probe side data is distributed to each thread according to G (x), and then the hash table of each thread is detected according to h (x), so there is no need to merge.
This is indeed an option, but considering this scheme, the probe side must first disperse the data locally according to G (x), which is an additional burden, while Presto’s scheme can directly divide the probe side data equally. Considering that the probe side is generally a larger table, Presto’s choice is better.
In addition, for the convenience of merging, G (x) = H (x) can be directly taken, so merging is an O (1) operation.

Next, the cost of distributed hash join is analyzed: all the data of build side must be placed in memory to quickly build and emit, which is a great memory overhead; The tables on both sides of the join must be broadcast through the network shuffle, which is also very expensive.
In order to control the memory overhead, CBO will select the small table as the build side. This depends on the statistics of the table.

Join Optimization: dynamic filtering

Generally speaking, the join key range of a small table is smaller than that of a large table. For these large table records that exceed the range, 1) it is not necessary to read the segmented / partitioned data into Presto; 2) Bulk data does not need to be shuffled through the network. When building a hash table, the build side counts the lower value range, which is almost zero cost. It only needs to tell the lower and upper reaches (generally this node, so it is a process controlled by a driver) to distribute large table data.
Due to this screening condition, Orc and parquet can make better use of dynamic filtering. At present, this optimization is only available for broadcast joins, and the summary data range of distributed joins will be troublesome (this node does not control the reading of large tables).

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png
Optimization of join: dynamic partition clipping

In a data warehouse (such as hive), for example, the fact table (order table) and dimension table (schedule) are joined. The following SQL.

SELECT COUNT(*) FROM 
store_sales JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
WHERE d_following_holiday='Y' AND d_year = 2000;

The predicate of dimension table will be pushed down to hive Orc for partition clipping, lazy reading and other optimizations (hive connector will talk about later), and a small table will be extracted, which is very efficient and no problem. But for the huge fact table, it will read the full amount, which is terrible.
At this time, you will think that there is an optimized dynamic filtering technology just now, which is just used. Yes, but it can only be used in broadcast join and orc / parquet. There are still restrictions. Even if it is used, it must be scanned partition by partition and file by file to determine the clipping. We want to cut it directly from the partition level.
Specific implementation: extend the dynamic filtering. First collect the join key range of the small table and pass it to the coordinator, then let the coordinator cut it according to the HMS information, and then allocate the splits. In this way, there are no restrictions on the join mode and storage format.

Table 4.11.3 statistical information

SPI statistics includes the following information:

  1. Number of rows;
  2. The number of unique values in the column (count (distinct));
  3. Column empty value ratio;
  4. Extreme value in column;
  5. The average data size of the column.

CBO can use these statistics for cost measurement.
For example, for the join process just now, CBO can use the number of rows and the average data size of columns in these statistics to judge the size of the table, so as to determine the join order.
For example, the following join order is purely based on memory considerations. Join from the largest lineitem, and build side uses smaller tables in turn:

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

However, considering the total cost, if we enable the join sequence rearrangement parameter, we will select the small table to join first and then the large table to avoid multiple network shuffle transmission caused by the large table:

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

For different tables and clusters, there will be different query schemes, which is the value of CBO.

4.11.4 filtering statistics

The two statistics we just used are not enough to rearrange the join order. For example, if the largest lineitem table has an equivalent predicate qualification, it will immediately change from the largest table to the smallest table (predicate push down), then the join order is completely different, as shown in the following figure (editor’s note: the join condition in the figure should be written incorrectly, and the figure comes from the figure in the book)

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

We can use the null value ratio, extreme value and the number of unique values of lineitem partkey to calculate the average number of rows of a value (assuming uniform distribution):

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Of course, when the data is obviously skewed, this scheme will cause problems. If the data source can provide histogram, CBO can be more accurate and avoid the problem of skew.
For external sources with partition tables and partition statistics, CBO will also directly use partition statistics.

4.11.6 join sequence enumeration

  • Partition enumeration: in the scenario of joining multiple tables, the order is very important, and the order of N table joins isSpecies. Presto will optimize the order every 9 tables by default, otherwise the cost is too high.
  • Dynamic programming: Join sequential enumeration problem. There is an optimal substructure. Presto uses top-down (recursive) dynamic programming to solve this problem.
  • Ignore cross join: almost all optimizers skip cross join, including presto, so the cross join table does not participate in sorting here.
  • Two kinds of search spaces: the general optimizer only considers the join scheme of left deep mode. Presto found that the join scheme of Bush tree may be better in distributed scenarios, and the enumeration cost is small, so Presto enumerated these two search spaces.

    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png

After understanding the above basic ideas, let’s describe the enumeration process:

  1. Put the nodes that can be reordered and the predicate qualification relationship between them into a multi join node; That is, in a multi join node, all nodes can be disordered freely, and any two tables can be joined (excluding cross join). Note that not all joins can be disordered freely, so they are not in multi join nodes. Note that the join predicate relationship between any two tables is actually derived through the predicate equality inference algorithm, which is not detailed here.

    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png
  2. Next, for each multi join node, we useWith memoAccording to the divide and conquer strategy, the tables and the predicate partition in the node, while retaining the results returned by each sub process, are written into the memory memo, which will be used later; Note that the partition here hasThere are two schemes (cN1 + cn2 +… + CN (n / 2)). However, since we only search two kinds of trees, we will only search the end (cN1) and the middle part region (CN n / 2) in the row of Yang Hui triangle. Because the middle part is the largest, the number of schemes is stillHowever, due to most of the pruning in the transition section, the previous constant factor will be very small and still within the acceptable range; On the other hand, due to the of dynamic programmingmemorandum, there will be a lot of pruning;
    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png
  3. For each partition scheme, we need to merge the two partitions into a join. The merge join has two physical execution modes: distributed and broadcast (described in the next section). Make a choice again. After the selection, record it in thememorandumLi;
    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png
  4. After filling in the required parts of the memo, we can choose the best scheme to implement the join. During the implementation process, we still need to constantly check the memo.

    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png

4.11.7 broadcast join and distributed join strategies

Here are two specific strategies of hash join:

  • Broadcast: each worker node saves a complete copy of the build side and a complete hash table. The worker node directly takes data from the external data source for join without broadcasting the probe side in the network.

    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png
  • Distributed: consistent with the hash join process we mentioned earlier, first partition the tables on both sides according to f (x) to different worker nodes, and execute the join completely independently and in parallel on each worker node.

Comparison of advantages and disadvantages: when the build side is very small or the probe side is very large, the broadcast type is generally better, because it avoids the full transmission of the probe side in the network; However, if the tables on both sides are large and must be partitioned into memory, distributed join is required.
It can be seen that this choice must use CBO and take filter information into account.
Presto can collect statistical information by analyzing external data sources. For data written through Presto ETL, you can directly analyze it when writing, which is more convenient.

Chapter 6 – Chapter 7 connectors

6.2 RDBMS-PostgreSQL Connector

Generally speaking, the task is distributed to a worker node. The worker node and Postgres server are connected through JDBC, and a table will have a JDBC connection. For SQL involving multiple tables, there may be multiple worker nodes and multiple JDBC connections. This is also the most parallelism provided by the RDBMS connector. Whether the underlying RDBMS is distributed or not, parallel data reading does not exist.

6.2.1 query and push down

Presto can push down an SQL or part of SQL to the data source, and let the data source process part of the query first, so as to reduce the data transmitted to Presto and reduce the overhead.
For example, for Presto connecting to Postgres, Presto receives an SQL message:

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Presto’s worker node will ask Postgres to execute the following SQL and tell itself the results:

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

As you can see, two parts are pushed down: projection column and filter criteria.
In the new version of Trino, aggregation, join, limit and topn can also be pushed down.
Whether a connector supports it depends on the specific instructions.

6.4 DFS-Hive Connector

  • Hive connector can be used not only for hive, but also for various distributed storage platforms. Non Hadoop distributed storage platforms should find a way to find a substitute for HMS, otherwise they can’t use hive connector.
  • Presto only views the information in hive Metastore, but never uses hive engine. Of course, there will be no query push down, just reading data.
  • Presto can take advantage of hive’s partition feature and partition columns in SQL, which is the most important feature of hive.
  • Presto has optimized several file formats of ORC parquet rcbinary rctext in hive to ensure reading efficiency.
Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Although hive engine will not participate in Presto’s SQL calculation, specific file formats in hive can participate, because many popular formats such as Orc rcbinary parquet provide a lot of statistical information to use.
We take the most commonly used Orc as an example. Orc provides an interface for vectorized column reading, which is just the requirement of Presto; Orc reader also makes two important optimizations: predicate push down and lazy read.

  • Predicate push down: in Orc file, extreme values are saved at three levels from the file to every 10000 lines. Predicate push down allows Orc reader to skip many irrelevant data segments;
  • Lazy reading: for predicate columns with strong randomness, pushing down does not make much sense. In order to ensure efficiency, ORC reader will read the predicate column and calculate the predicate. After calculation, find the relevant segments and supplement the information at the corresponding positions of other columns.

    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png

    These two properties have improved the performance of Presto for hive connector by 3 ~ 8 times.

  • Predicate filtering order: different predicates have different filtering properties. First select the ones with strong filtering properties, and then select the ones with weak filtering properties, so as to obtain better performance. Orc reader also needs to evaluate the filtering properties of predicates. In addition, whether to cache the filtered data also needs to be considered.

    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png
  • Complex type query push down: generally, for a complex type, we only need some of its sub fields, so we can push this projection operation directly downstream of the tablescan operator; The same predicate push down can be performed for complex queries.

7.2 Nosql-Accumulo

Accumulo is as like as two peas in Hbase. However, HBase already has Phoneix, so Presto can directly connect Phoneix to HBase. So here’s a substitute accumulo to clarify some problems.

  • Presto splits the rowkey in NoSQL to level a key value record into multiple relational data to correspond to the relational model.
  • For NoSQL, the coordiator first connects with the database to obtain metadata, and then determines the region according to the query range. For different region servers, you canParallel read
    Interpretation of Presto / Trino authoritative guidelines and official design documents

    image.png
  • Presto can use the secondary index of NoSQL to query and push down;

7.6 federal inquiry

Presto can query tables from multiple data sources simultaneously in one SQL, which is very transparent to users.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Federated queries can also be pushed down. The principle is still that data warehouse types cannot be pushed down; NoSQL only pushes down projection and filtering; The RDBMS part can push down join and aggregation.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

7.7 ETL

Presto also supports ETL, but it is not as good as some commercial tools that only do ETL. However, Presto also has its own advantages. First, you may not need ETL through federated query. Second, Presto’s ETL is all SQL, which is very convenient to switch between multiple data sources.

Chapter 12 Presto in production environment

12.2 Presto SQL query tuning

Here are some key points for tuning:

  • Explain join order
  • Join prepolymerization
  • Group by key pre aggregation
  • Join hash table size

12.3 memory management

Presto’s memory management is based on the worker node JVM.
User memory: the memory occupied by operations in client queries, such as aggregation and sorting;
System memory: memory used by query engine, such as input / output buffer, table scan buffer, etc.
Presto has a concept of the number of initial hash partitions. If it is set to 8, assume query Max memory = 50GB, the average memory occupation of each worker node user is 50GB / 8 = 6.25gb, if query If Max memory per node is set to 13gb, this configuration can allow data skew more than twice the average value.
All worker nodes of Presto should be configured equally, otherwise they cannot be used properly.

12.4 task concurrency

  • Task work threads: it is twice the number of CPU cores by default, and can also be increased manually. However, consider whether the cost of context switching can be covered;
  • Operator Concurrency: concurrency here actually refers to the degree of parallelism. Operators (such as join and aggregation) have a local degree of parallelism. The higher the degree of parallelism, the greater the price of context switching.

12.5 work node scheduling

  • The upper limit of the number of splits processed by each work node and the upper limit of the splits queue of each task can be appropriately increased to increase the load of a single node;
  • If there is a data cache (rubix) or the storage and computing nodes are not completely isolated, you can save more space in the queue of Presto scheduler for local data and reduce network propagation. Usually, this mode will be turned on.

12.6 network data exchange

  • The number of threads pulling data from downstream to upstream may increase throughput, but pay attention to memory usage;
  • The input / output buffer size is 32MB by default. It is really too small. You can enlarge it to prevent the back pressure mechanism from taking effect and reduce the query efficiency.

Performance evaluation (benchmark)

Evaluation of CBO performance improvement:

Experiments and data are from starburst blog:https://blog.starburstdata.com/technical-blog/presto-cost-based-optimizer-rocks-the-tpc-benchmarks
In the experiment, Presto worker is deployed on 8 HDFS data nodes, and one coordiator, namenode and HMS share node.
Each node has 24 cores, 256gb memory and 160GB heap memory; Orc format, zlib compression.
More than 80% of queries are over 1TB data set (tpc-ds), including 10TB.

Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png
Interpretation of Presto / Trino authoritative guidelines and official design documents

image.png

Recommended Today

Proper memory alignment in go language

problem type Part1 struct { a bool b int32 c int8 d int64 e byte } Before we start, I want you to calculatePart1What is the total occupancy size? func main() { fmt.Printf(“bool size: %d\n”, unsafe.Sizeof(bool(true))) fmt.Printf(“int32 size: %d\n”, unsafe.Sizeof(int32(0))) fmt.Printf(“int8 size: %d\n”, unsafe.Sizeof(int8(0))) fmt.Printf(“int64 size: %d\n”, unsafe.Sizeof(int64(0))) fmt.Printf(“byte size: %d\n”, unsafe.Sizeof(byte(0))) fmt.Printf(“string size: %d\n”, […]