The secret of Greenplum distributed database kernel (Part 1)

Time:2021-3-4

Greenplum is the most mature open source distributed analytical database (the OLTP performance of Greenplum 6 released in August 2019 has been greatly improved, becoming a real HTAP database, and the evaluation data will be released in the near future). The latest evaluation of Gartner 2019 shows that Greenplum ranks third in the world in the field of classic data analysis, and fourth in the field of real-time data analysis. The only open source database product in the top ten of the two fields. This means that if you choose a product based on open source, there is no choice but this one in the top ten. Gartner report details.

So GreenplumDistributedHow is the database refined? Greenplum is known to be based on PostgreSQL. PostgreSQL is the most advanced single node database with many related kernel documents and papers. However, there is relatively little information about how to transform single node PostgreSQL into distributed database. This paper introduces the main work of developing single node PostgreSQL database into distributed MPP database from six aspects. Of course, these are just a minimalist overview. It costs hundreds of millions of dollars to achieve enterprise level production. It is the result of more than ten years’ R & D investment of a database top talent team of 100 people. Although not required, understanding the basic kernel knowledge of PostgreSQL is helpful to understand some of the details in this article. Bruce Momjian’s PPT is an excellent introduction

(https://momjian.us/main/presentations/internals.html)。

1. Overview of Greenplum clustering

PostgreSQL is the most advanced stand-alone open source database in the world. Greenplum is based on PostgreSQL and is the most advanced open source MPP database in the world (for more information about Greenplum, please visit Greenplum Chinese community). From the user’s point of view, Greenplum is a complete relational database management system (RDBMS). From the physical level, it contains multiple PostgreSQL instances, which can be accessed separately. In order to realize the division and cooperation of multiple independent PostgreSQL instances and present users with a logical database, Greenplum implements distributed clustering processing on data storage, computing, communication and management at different levels. Although Greenplum is a cluster, it encapsulates all the distributed details and provides users with a single logical database. This kind of encapsulation greatly liberates developers and operators.

Transforming single node PostgreSQL into a cluster involves many aspects. This paper mainly introduces six aspects: data distribution, query plan parallelization, execution parallelization, distributed transaction, data shuffle and management parallelization.

Greenplum adds a lot of other functions to PostgreSQL, such as append optimized table, column storage table, external table, multi-level partition table, fine-grained resource manager, Orca Query optimizer, backup and recovery, high availability, fault detection and recovery, cluster data migration, capacity expansion, madlib machine learning algorithm library, containerized execution UDF, PostGIS extension, gptext suite, monitoring management, kubernetes integration, etc.

The figure below shows an overhead view of a Greenplum cluster. There is a master node and two segment nodes. Four segment instances are deployed on each segment node to improve resource utilization. Each instance, whether master or segment, is a physically independent PostgreSQL database.

The secret of Greenplum distributed database kernel (Part 1)

2. Distributed data storage

Distributed data storage is the first problem to be solved in distributed database. The basic principle of distributed data storage is relatively simple and easy to implement. Many database middleware can also achieve basic distributed data storage. Greenplum not only achieves the basic distributed data storage, but also provides many more advanced and flexible features, such as multi-level partition and polymorphic storage. Greenplum 6 further enhances this area, implements consistent hashing and table replication, and allows users to intervene in data distribution methods according to their applications. As shown in the figure below, what users see is a logical database, and each database has a system table (for example, pgclass, PG under pgcatalog)_ And user tables (sales table and customers table in the following example). At the physical level, it has many independent databases. Each database has its own system table and user table. The master database only contains metadata and does not save user data. There are still user data tables on the master. These user data tables are empty and have no data. The optimizer needs to use these empty tables for query optimization and plan generation. Most of the system tables (except for a few tables, such as statistics related tables) in segment database are the same as those in master database. Each segment stores part of user data table.

The secret of Greenplum distributed database kernel (Part 1)

In Greenplum, user data is distributed to different segment instances of different nodes according to a certain strategy. Each instance has its own independent data directory to save user data in the form of disk file. Using standard insert SQL statements, data can be automatically distributed to the appropriate nodes according to user-defined policies. However, the performance of insert is low and it is only suitable for inserting a small amount of data. Greenplum provides a special parallel data loading tool to achieve efficient data import. For details, please refer to the official documents of gpfdist and gpload. In addition, Greenplum also supports parallel copy, which is the fastest data loading method if the data has been saved on each segment. The figure below vividly shows that the sales table data of users are distributed to different segment instances.

The secret of Greenplum distributed database kernel (Part 1)

In addition to supporting the horizontal distribution of data among different nodes, Greenplum also supports partition according to different standards on a single node, and supports multi-level partition. Greenplum supports the following partition methods:

  • Range partition: partition data according to the time range or numerical range of a column. For example, the following SQL will create a partition table, which is partitioned by day. From January 1, 2016 to January 1, 2017, the data of a year is divided into 366 partitions by day:
CREATE TABLE sales (id int, date date, amt decimal(10,2))  
DISTRIBUTED BY (id)  
PARTITION BY RANGE (date)  
( START (date '2016-01-01') INCLUSIVE  
 END (date '2017-01-01') EXCLUSIVE  
 EVERY (INTERVAL '1 day') );  
  
  • List partition: according to the data value list of a column, the data cannot be divided into different partitions. For example, the following SQL creates a partition table based on gender. The table has three partitions: one for women’s data, one for men’s data, and other values such as null are stored in a separate other partition.
CREATE TABLE rank (id int, rank int, year int, gender char(1), count int )  
DISTRIBUTED BY (id)  
PARTITION BY LIST (gender)  
( PARTITION girls VALUES ('F'),  
PARTITION boys VALUES ('M'),  
DEFAULT PARTITION oth er );  

The figure below shows that the user’s sales table is first distributed to two nodes, and then each node is partitioned according to a certain standard. The main purpose of partitioning is to implement partition clipping to improve performance by reducing data access. Partition pruning means that the optimizer automatically filters out the unnecessary partitions according to the query conditions, so as to reduce the amount of data scanning during query execution. PostgreSQL supports static conditional partition clipping, and Greenplum implements dynamic partition clipping through Orca optimizer. Dynamic partition clipping can improve the performance from ten times to hundreds of times.

The secret of Greenplum distributed database kernel (Part 1)

Greenplum supports polymorphic storage, that is, a single user table. Different storage methods can be used to store different partitions according to different access modes. Generally, data of different ages have different access modes, and different access modes have different optimization schemes. Polymorphic storage provides the best performance for different data in a transparent way. Greenplum provides the following storage methods:

  • Heap table: heap table is the default storage mode of Greenplum and PostgreSQL. Support efficient update and delete operations, access to multiple columns quickly, usually used for OLTP queries.
  • Append optimized table: specially optimized table storage mode for appending, which is usually used to store fact tables in data warehouse. Not suitable for frequent update operations.
  • AOCO (append optimized, column oriented) table: the AOCO table is a list, has a good compression ratio, supports different compression algorithms, and is suitable for query scenarios with fewer columns.
  • External table: the data of the external table is stored externally (the data is not managed by Greenplum), and only the metadata information of the external table is in Greenplum. Greenplum supports many external data sources, such as S3, HDFS, files, gemfire, various relational databases, and various data formats, such as text, CSV, Avro, parquet, etc.

As shown in the figure below, assuming that the sales table mentioned above is partitioned by month, different storage strategies can be adopted to store data of different times. For example, the data of the last three months is stored in heap table, the older data is stored in column, and the data of a year ago is stored in S3 or HDFS by external table.

The secret of Greenplum distributed database kernel (Part 1)

Data distribution is the foundation of any MPP database, and it is also one of the keys to the efficiency of MPP database. By dispersing massive data to multiple nodes, on the one hand, it greatly reduces the amount of data processed by a single node, on the other hand, it lays the foundation for parallel processing. The combination of the two can greatly improve the performance of the whole system. For example, in a cluster of 100 nodes, each node only saves one percent of the total data. If 100 nodes are parallel processing at the same time, the performance will be dozens of times higher than that of a single node with stronger configuration. If the data distribution is uneven and data skew occurs, the performance of the whole system will be the same as that of the slowest node due to the short board effect. Therefore, whether the data distribution is reasonable or not has a great impact on the overall performance of Greenplum.

Greenplum 6 provides the following data distribution strategies,

  • Hash distribution
  • random distribution
  • Replicated table

Hash distribution

Hash distribution is the most commonly used data distribution method in greenlum. The hash value of user data is calculated according to the predefined distribution key, and then the hash value is mapped to a segment. A distribution key can contain multiple fields. Whether the distribution key selection is appropriate or not is the main factor of Greenplum’s performance. A good distribution key distributes the data evenly to each segment to avoid data skew.

The code for Greenplum to calculate the hash value of the distribution key is in cdbhash. C. Structure cdbhash is the main data structure to process distributed key hash. The logic for calculating the hash value of the distribution key is as follows:

  • Use makecdbhash (int segnum) to create a cdbhash structure
  • Then perform the following operations on each tuple, calculate the hash value corresponding to the tuple, and determine which segment the tuple should be distributed to:

    • Cdbhashinit(): perform initialization
    • This function will call hashdatum() to do different preprocessing for different types. Finally, addtocdbhash() will add the processed column values to the hash calculation
    • Cdbhashreduce() maps the hash value to a segment

Cdbhash structure:

typedef struct CdbHash  
{  
   Uint32 hash; / * hash result value*/  
   Number of int numsegs; / * segments*/  
   Cdchashreduce reducealg; / * algorithm for reducing buckets*/  
   Uint32 rrindex; / * circular index*/  
} CdbHash;

Main functions

  • Makecdbhash (int numsegs): creates a cdbhash structure that maintains the following information:

    • Number of segments
    • Reduction method
  • If the number of segments is a power of 2, reduce is used_ Bitmask, otherwise reduce is used_ LAZYMOD.
  • The hash value in the struct is initialized for each tuple, which occurs in cdbhashinit().
  • void cdbhashinit(CdbHash *h)

    h->hash = FNV1_ 32_ Init; reset the hash value to the initial offset basis

  • Void cdbhash (cdbhash * h, datum datum, oid type): add an attribute to cdbhash calculation, that is, add an attribute to be considered when calculating hash. This function passes in a function pointer: addtocdbhash.
  • void addToCdbHash(void cdbHash, void buf, size_ T len); implemented the datumhashfunction

h->hash = fnv1_ 32_ BUF (buf, len, H – > hash); / / execute 32-bit FNV 1 hash in buffer

Generally, the call path is evalhashkey > cdbhash > hashdatum > addtocdbhash unsigned int cdbhashreduce (cdbhash * h): to map a hash value to a segment, the main logic is modulus, as follows:

switch (h->reducealg)  
{  
 case REDUCE_BITMASK:  
    result = FASTMOD(h->hash, (uint32) h->numsegs);       /* fast mod (bitmask) */  
    break;  
  
 case REDUCE_LAZYMOD:  
    result = (h->hash) % (h->numsegs); /* simple mod */  
    break;  
}  

For each tuple, execute the following flow:

  • void cdbhashinit(CdbHash *h)
  • void cdbhash(CdbHash *h, Datum datum, Oid type)
  • void addToCdbHash(void cdbHash, void buf, size_t len)
  • unsigned int cdbhashreduce(CdbHash *h)

random distribution

If the hash distribution key of a table cannot be determined or there is no reasonable distribution key to avoid data skew, random distribution can be used. Random distribution will store the data inserted once to different nodes in a circular way. Randomness is only valid in a single SQL, regardless of cross SQL. For example, if one row of data is inserted into the random distribution table at a time, all the final data will be saved on the first node.

test=# create table t1 (id int) DISTRIBUTED RANDOMLY;  
CREATE TABLE  
test=# INSERT INTO t1 VALUES (1);  
INSERT 0 1  
test=# INSERT INTO t1 VALUES (2);  
INSERT 0 1  
test=# INSERT INTO t1 VALUES (3);  
INSERT 0 1  
test=# SELECT gp_segment_id, * from t1;  
gp_segment_id | id  
---------------+----  
            1 |  1  
            1 |  2  
            1 |  3  

Some tools use random distribution to realize data management. For example, gpexpand needs to redistribute data after adding nodes. During initialization, gpexpand will mark all tables as randomly distributed, and then perform the redistribution operation, so that the redistribution operation does not affect the normal operation of the business. (Greenplum 6 redesigns gpexpand and does not need to change the distribution strategy to random distribution).

Replicated table

Greenplum 6 supports a new distribution strategy: copy table, that is, the whole table has a complete copy on each node.

test=# CREATE TABLE t2 (id int) DISTRIBUTED REPLICATED;  
CREATE TABLE  
test=# INSERT INTO t2 VALUES (1), (2), (3);  
INSERT 0 3  
test=# SELECT * FROM t2;  
id  
----  
 1  
 2  
 3  
(3 rows)  
  
test=# SELECT gp_segment_id, * from t2;  
gp_segment_id | id  
---------------+----  
            0 |  1  
            0 |  2  
            0 |  3  

Copying tables solves two problems:

  • UDF cannot access any tables on segment. Due to the characteristics of MPP, any segment only contains part of the data, so UDF executed in segment cannot access any table, otherwise the data calculation is wrong.
yydzero=# CREATE FUNCTION c() RETURNS bigint AS $$  
yydzero$#  SELECT count(*) from t1 AS result;  
yydzero$# $$ LANGUAGE SQL;  
CREATE FUNCTION  
yydzero=# SELECT c();  
c  
---  
6  
(1 row)  
yydzero=# select c() from t2;  
ERROR:  function cannot execute on a QE slice because it accesses relation "public.t1"  (seg0 slice1 192.168.1.107:25435 pid=76589)  

If T1 above is changed to copy table, this problem does not exist.

There are many application scenarios for copying tables, such as spatial in PostGIS_ ref_ Sys (PostGIS has a large number of UDFs to access this table) and PLR in PLR_ All modules can use the copy table mode. Before supporting this feature, Greenplum can only support spatial through a few tricks_ ref_ Sys and so on.

  • Avoid distributed query plan: if the data of a table is copied on each segment, a local connection plan can be generated to avoid data moving between different nodes in the cluster. If the replication table is used to store a table with a small amount of data (for example, thousands of rows), the performance will be significantly improved. Replication table mode is not suitable for tables with large amount of data.

3. Query plan parallelization

The query plan generated by PostgreSQL can only be executed on a single node. Greenplum needs to parallelize the query plan to give full play to the advantages of cluster.

Greenplum introduces motion operator (operator) to parallelize query plan. Motion operator realizes the transmission of data between different nodes. It hides the difference between MPP architecture and stand-alone for other operators, so that most other operators do not care whether they are executed on cluster or stand-alone. Each motion operator has a sender and a receiver. In addition, Greenplum also optimizes some operators, such as aggregation. (for basic knowledge of PostgreSQL optimizer, please refer to Src / backend / optimizer / readme)

Optimization examples

Before we go into the technical details, let’s look at a few examples.

In the following example, two tables T1 and T2 are created, both of which have two columns C1 and C2 with C1 as the distribution key.

CREATE table t1 AS SELECT g c1, g + 1 as c2 FROM generate_series(1, 10) g DISTRIBUTED BY (c1);  
CREATE table t2 AS SELECT g c1, g + 1 as c2 FROM generate_series(5, 15) g DISTRIBUTED BY (c1);  
  
SQL1:  
  
SELECT * from t1, t2 where t1.c1 = t2.c1;  
c1 | c2 | c1 | c2  
----+----+----+----  
 5 |  6 |  5 |  6  
 6 |  7 |  6 |  7  
 7 |  8 |  7 |  8  
 8 |  9 |  8 |  9  
 9 | 10 |  9 | 10  
10 | 11 | 10 | 11  
(6 rows)  

The query plan of sql1 is as follows. Because the association key is the distribution key of two tables, the association can be executed locally. The subtree of hashjoin operator does not need to move data. Finally, the gathermotion can summarize on the master.

QUERY PLAN  
-----------------------------------------------------------------  
Gather Motion 3:1  (slice1; segments: 3)  (cost=3.23..6.48 rows=10 width=16)  
  ->  Hash Join  (cost=3.23..6.48 rows=4 width=16)  
        Hash Cond: t2.c1 = t1.c1  
        ->  Seq Scan on t2  (cost=0.00..3.11 rows=4 width=8)  
        ->  Hash  (cost=3.10..3.10 rows=4 width=8)  
              ->  Seq Scan on t1  (cost=0.00..3.10 rows=4 width=8)  
Optimizer: legacy query optimizer  

The secret of Greenplum distributed database kernel (Part 1)

SQL2:

SELECT * from t1, t2 where t1.c1 = t2.c2;  
c1 | c2 | c1 | c2  
----+----+----+----  
 9 | 10 |  8 |  9  
10 | 11 |  9 | 10  
 8 |  9 |  7 |  8  
 6 |  7 |  5 |  6  
 7 |  8 |  6 |  7  
(5 rows)  

The query plan of SQL2 is as follows. The association key C1 of table t1 is also its distribution key, while the association key C2 of table t2 is not. Therefore, the data needs to be redistributed according to t2.c2, so that all rows with t1.c1 = t2.c2 are associated on the same segment.

QUERY PLAN  
-----------------------------------------------------------------  
Gather Motion 3:1  (slice2; segments: 3)  (cost=3.23..6.70 rows=10 width=16)  
  ->  Hash Join  (cost=3.23..6.70 rows=4 width=16)  
        Hash Cond: t2.c2 = t1.c1  
        ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=0.00..3.33 rows=4 width=8)  
              Hash Key: t2.c2  
              ->  Seq Scan on t2  (cost=0.00..3.11 rows=4 width=8)  
        ->  Hash  (cost=3.10..3.10 rows=4 width=8)  
              ->  Seq Scan on t1  (cost=0.00..3.10 rows=4 width=8)  
Optimizer: legacy query optimizer  

The secret of Greenplum distributed database kernel (Part 1)

SQL3:

SELECT * from t1, t2 where t1.c2 = t2.c2;  
c1 | c2 | c1 | c2  
----+----+----+----  
 8 |  9 |  8 |  9  
 9 | 10 |  9 | 10  
10 | 11 | 10 | 11  
 5 |  6 |  5 |  6  
 6 |  7 |  6 |  7  
 7 |  8 |  7 |  8  
(6 rows)

The query plan of SQL3 is as follows: the association key C2 of T1 is not a distribution key, and the association key C2 of T2 is not a distribution key. Therefore, broadcast motion is used to broadcast the data of one table to all nodes to ensure the correctness of the association. The plan generated by the latest master code for this query will choose to redistribute the two tables. Why can we do this as a question for consideration:).

QUERY PLAN  
-----------------------------------------------------------------  
Gather Motion 3:1  (slice2; segments: 3)  (cost=3.25..6.96 rows=10 width=16)  
  ->  Hash Join  (cost=3.25..6.96 rows=4 width=16)  
        Hash Cond: t1.c2 = t2.c2  
        ->  Broadcast Motion 3:3  (slice1; segments: 3)  (cost=0.00..3.50 rows=10 width=8)  
              ->  Seq Scan on t1  (cost=0.00..3.10 rows=4 width=8)  
        ->  Hash  (cost=3.11..3.11 rows=4 width=8)  
              ->  Seq Scan on t2  (cost=0.00..3.11 rows=4 width=8)  
Optimizer: legacy query optimizer  

The secret of Greenplum distributed database kernel (Part 1)

SQL4:

SELECT * from t1 LEFT JOIN t2 on t1.c2 = t2.c2 ;  
c1 | c2 | c1 | c2  
----+----+----+----  
 1 |  2 |    |  
 2 |  3 |    |  
 3 |  4 |    |  
 4 |  5 |    |  
 5 |  6 |  5 |  6  
 6 |  7 |  6 |  7  
 7 |  8 |  7 |  8  
 8 |  9 |  8 |  9  
 9 | 10 |  9 | 10  
10 | 11 | 10 | 11  
(10 rows)  

The query plan of sql4 is as follows. Although the association key is the same as SQL3, the method of broadcasting T1 cannot be used because the left join is used. Otherwise, the data will be duplicated. Therefore, the query plan redistributes both tables. According to the different path cost, the sql4 optimizer may also choose to broadcast T2. (if the amount of data is the same, the cost of single table broadcasting is higher than that of double table redistribution. For double table redistribution, each tuple of each table is transmitted once, which is equivalent to two tuples of each single table, while broadcasting requires each tuple of a single table to transmit nsegments times).

QUERY PLAN  
-----------------------------------------------------------------  
Gather Motion 3:1  (slice3; segments: 3)  (cost=3.47..6.91 rows=10 width=16)  
  ->  Hash Left Join  (cost=3.47..6.91 rows=4 width=16)  
        Hash Cond: t1.c2 = t2.c2  
        ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=0.00..3.30 rows=4 width=8)  
              Hash Key: t1.c2  
              ->  Seq Scan on t1  (cost=0.00..3.10 rows=4 width=8)  
        ->  Hash  (cost=3.33..3.33 rows=4 width=8)  
              ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..3.33 ...  
                    Hash Key: t2.c2  
                    ->  Seq Scan on t2  (cost=0.00..3.11 rows=4 width=8)  
Optimizer: legacy query optimizer  

The secret of Greenplum distributed database kernel (Part 1)

SQL5:

SELECT  c2, count(1) from t1 group by c2;  
c2 | count  
----+-------  
 5 |     1  
 6 |     1  
 7 |     1  
 4 |     1  
 3 |     1  
10 |     1  
11 |     1  
 8 |     1  
 9 |     1  
 2 |     1  
(10 rows)  

The above four SQL shows the influence of different types of join on the data movement type (motion type). Sql5 demonstrates Greenplum’s optimization of aggregation: two-stage aggregation. The first stage of aggregation is performed on each segment for local data, and then the second stage of aggregation is performed by redistributing to each segment. Finally, the master collects the data through gather motion. Greenplum also uses three-stage aggregation for some SQL, such as distinct group by.

QUERY PLAN  
-----------------------------------------------------------------  
Gather Motion 3:1  (slice2; segments: 3)  (cost=3.55..3.70 rows=10 width=12)  
  ->  HashAggregate  (cost=3.55..3.70 rows=4 width=12)  
        Group Key: t1.c2  
        ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=3.17..3.38 rows=4 width=12)  
              Hash Key: t1.c2  
              ->  HashAggregate  (cost=3.17..3.17 rows=4 width=12)  
                    Group Key: t1.c2  
                    ->  Seq Scan on t1  (cost=0.00..3.10 rows=4 width=4)  
Optimizer: legacy query optimizer  
(9 rows)  

The secret of Greenplum distributed database kernel (Part 1)

New data structures and concepts introduced by Greenplum for query optimization

The previous intuitive examples show the different distributed query plans generated by Greenplum for different SQL. The main internal mechanism is introduced below.

In order to change single machine query plan into parallel plan, Greenplum introduces some new concepts, which enhance the node, path and plan structures of PostgreSQL respectively

  • Add a new node type: flow
  • Add a new path type: cdbmotionpath
  • Add a new query plan operator: the first field of motion is plan, and the first field of plan structure is nodetag type. The first node of flow is nodetag type, which is a level concept with rangevar, intoclause, expr and rangetableref.)
  • The field cdbpathlocation locus is added to the path structure to represent the redistribution strategy of the result tuple under this path
  • Add a flow field to the plan structure to represent the tuple flow direction of the operator;

New node type: flow

The new node type flow describes the flow direction of tuples in parallel plans. Each query plan node (plan structure) has a flow field to represent the flow direction of the output tuple of the current node. Flow is a new node type, but not a query plan node. In addition, the flow structure also includes some member fields for plan parallelization.

Flow has three main fields:

  • Flowtype, indicating the type of flow
  • Undefined: undefined flow
  • Singleton: it represents the gatemotion
  • Replicated: it means broadcast motion
  • Partitioned: represents redistribution motion.
  • Movement, which determines the output of the current plan node and what kind of motion to use. It is mainly used to process the sub query plan to adapt to the distributed environment.
  • None: no motion required
  • Focus: focus on a single segment, equivalent to the gathermotion
  • Broadcast: Broadcast motion
  • Repartition: hash redistribution
  • Explicit: directional move tuples to SEGID field marked segments
  • Cdblocustype: the type of a location. The optimizer uses this information to select the most appropriate node to process the most appropriate data flow and determine the appropriate motion.
  • CdbLocusType_ Null: no locus
  • CdbLocusType_ Entry: represents a single backend process on entry dB (i.e. Master). It can be QD (query dispatcher) or QE (query executor) on entry dB
  • CdbLocusType_ Single QE: a single backend process on any node, which can be QD or any QE process
  • CdbLocusType_ General: compatible with any locus
  • CdbLocusType_ Replicated: there are replicas in all QES
  • CdbLocusType_ Hashed: hashes are distributed to all QES
  • CdbLocusType_ Stream: the data is distributed, but the distribution key is unknown

New path type: cdbmotionpath

Path represents a possible calculation path (such as sequential scan or hash Association). More complex paths inherit the path structure and record more information for optimization. Greenplum adds a new field “cdbpathlocation locus” to the path structure, which is used to represent the redistribution and execution strategy of the result tuple in the current path.

The distribution key of the table in Greenplum determines the distribution of tuples when they are stored, and affects the storage of tuples on the disk of that segment. Cdbpathlocus determines the redistribution of a tuple among different processes (QE of different segments) during execution, that is, a tuple should be processed by that process. Tuples may come from tables or functions.

Greenplum also introduces a new path: cdbmotionpath, which is used to show how the results of the sub path are transmitted from the sender process to the receiver process.

New plan operator: motion

As mentioned above, motion is a kind of query plan tree node, which implements shuffle of data, so that its parent operator can get the required data from its child operator. There are three types of motion:

  • MOTIONTYPE_ Hash: the hash algorithm is used to redistribute the data according to the redistributing key, and each tuple passing through the operator is sent to the target segment, which is determined by the hash value of the redistributing key.
  • MOTIONTYPE_ Fixed: send tuples to a fixed set of segments, which can be broadcast motion (sent to all segments) or gather motion (sent to a fixed segment)
  • MOTIONTYPE_ Explicit: send tuples to segments specified in its SEGID field, corresponding to explicit redistribution motion. And motiontype_ The difference between hash and hash is that hash does not need to be calculated.

As mentioned earlier, Greenplum introduced the flow * flow field for the plan structure to represent the flow direction of the result tuple. In addition, several other fields related to optimization and execution are introduced into the plan structure, such as the dispatchmethod dispatch field that indicates whether MPP scheduling is required, the directdispatch field that indicates whether MPP scheduling can be directly performed (directly scheduled to a segment, usually used for primary key query), and the distributed plan field that facilitates MPP execution Slicetable, motionnode used to record the parent motion node of the current plan node, etc.

Generate distributed query plan

The figure below shows the optimization process of the traditional optimizer (Orca optimizer is different from it) in Greenplum. This section emphasizes the parts different from the stand-alone optimizer of PostgreSQL.

standard_ Planner is the default optimizer of PostgreSQL, which mainly calls subquery_ Planner and set_ plan_ references。 In Greenplum, set_ plan_ After references, cdbparallelize is called to parallelize the query tree.

subquery_ The planner optimizes a subquery and generates a query plan tree as its name indicates. It has two main execution stages: one is the query plan tree, the other is the query plan tree

  • The optimization of basic query features (also known as SPJ: Select / projection / join) is implemented by query_ Planner() implementation
  • The optimization of advanced query features (non SPJ), such as aggregation, is implemented by grouping_ Planner() implementation, grouping_ Planner () calls query_ Planner () optimizes the basic features, and then optimizes the advanced features.

Greenplum’s distributed processing of stand-alone plan mainly takes place in two places

  • Single subquery: Greenplum’s subquery_ The sub query plan tree returned by Planner () has been distributed, such as adding motion operator to hashjoin, two-stage aggregation, etc.
  • Among multiple subqueries: Greenplum needs to set the appropriate data flow among multiple subqueries, so that the results of a subquery can be used by the upper query tree. This operation is implemented by the function cdbparallelize.

The secret of Greenplum distributed database kernel (Part 1)

Parallelization of single subquery

The process of Greenplum optimizing single subquery is similar to that of PostgreSQL

  • Association: according to the data distribution of the left and right sub tables of the association operator, determine whether to add a motion node, what type of motion, etc.
  • Optimization of advanced operations such as aggregation, such as the two-stage aggregation mentioned above.

The following is a brief introduction of the main process:

First use build_ simple_ Rel () to build a simple table. build_ simple_ Rel gets the basic information of the table, such as how many tuples are in the table and how many pages are occupied. One of the most important information is the data distribution information: gppolicy describes the data distribution type and key of the basic table.

Then use set_ base_ rel_ Pathlists() sets the access path of the basic table. set_ base_ rel_ Pathlists calls different functions according to different table types

  • RTE_FUNCTION: create_functionscan_path()
  • RTE_RELATION: create_external_path()/create_aocs_path()/create_seqscan_path()/create_index_paths()
  • RTE_VALUES: create_valuesscan_path

These functions determine the locus type of the path node and represent a feature related to data distribution processing. This information is very important for subquery parallelization. When the path is converted into a plan later, it is used to determine the flow type of a plan, and the flow will determine what type of Gang the executor uses to execute.

How to determine locus?

For a normal heap table, the sequential scan path is create_ seqscan_ Path () determines the location information of the path in the following way:

  • If the table is a hash distribution, the locus type is cdblocustype_ Hashed
  • If the distribution is random, the locus type is cdblocustype_ Strewn
  • If it is a system table, the locus type is cdblocustype_ Entry

For functions, create_ function_ Path () determines the location of the path in the following way:

  • If the function is immutable, use: cdblocustype_ General
  • If the function is mutable, use: cdblocustype_ Entry
  • If the function needs to be executed on the master, use: cdblocustype_ Entry
  • If the function needs to be executed on all segments, use cdblocustype_ Strewn

If the SQL statement contains an association, use make_ rel_ from_ Joinlist () generates access paths for the association tree. The corresponding functions are: create_ nestloop_ path/create_ mergejoin_ path/create_ hashjoin_ path。 The most important point in this process is to determine whether you need to add a motion node and what type of motion node. For example, the former sql1 association key is the distribution key of two tables T1 / T2, so there is no need to add motion; while SQL2 needs to redistribute T2, so that for any T1 tuple, all T2 tuples satisfying the association condition (T1. C1 = T2. C2) are on the same segment.

If SQL contains advanced features such as aggregation and window function, call CDB_ grouping_ Planner () for optimization, such as transforming aggregation into two-stage aggregation or three-stage aggregation.

The final step is to choose the cheapest path from all possible paths and call create_ Plan () transforms the optimal path tree into the optimal query tree.

At this stage, the location of the path affects the flow type of the generated plan plan. Flow is related to gang in the section of actuator. Flow makes the actuator not care about the form of data distribution and the distribution key, but only care about whether the data is on multiple segments or a single segment. The corresponding relationship between locus and flow is as follows

  • FLOW_SINGLETON: Locus_Entry/Locus_SingleQE/Locus_General
  • FLOW_PARTITIONED: Locus_Hash/Locus_Strewn/Locus_Replicated

Parallelization of multiple subqueries

The main purpose of cdbparallelize () is to solve the data flow between multiple subqueries and generate the final parallel query plan. It consists of two main steps: prescan and apply_ motion

  • Prescan has two purposes. One is to mark certain types of planning nodes (such as flow) for later application_ The second purpose is to mark or deform the subplan. In fact, subplan is not a query plan node, but an expression node. It contains a plan node and its range table. Subplan corresponds to a sublink (SQL subquery expression) in the query tree and may appear in the expression. Prescan processes the plan tree contained in the subplan as follows:
  • If Subplan is a Initplan, then mark the root node of the query tree to indicate the need to call apply_ later. Motion adds a motion node.
  • If the subplan is an unrelated multi row subquery, gather or broadcast operations are performed on the subquery according to the flow information contained in the plan node. A new materialized node is added to the query tree to prevent rescanning the subplan. Because it avoids re executing sub query every time, the efficiency is improved.
  • If the subplan is a related subquery, it is converted to an executable form. Recursively scan until the leaf scan node is encountered, and then replace the scan node with the following form. After this transformation, the query tree can be executed in parallel, because the related sub query has become a part of the result node and is in the same slice as the outer query node.
Result  
          \  
           \_Material  
             \  
              \_Broadcast (or Gather)  
                \  
                 \_SeqScan  
  • apply_ Motion: add a motion node to the top query tree according to the flow node in the plan. Add different motion nodes according to different types of subplan (such as initplan, uncorrelated multiline subquery and related subquery).

For example, select * from TBL where id = 1, prescan () will mark on the root node when traversing to the root node of the query tree_ Add a gathermotion on top of the root node.

This article mainly introduces the overview of Greenplum cluster, distributed data storage and distributed query optimization. The next article will continue to introduce distributed query execution, distributed transactions, data shuffling and cluster management.

The secret of Greenplum distributed database kernel (Part 1)