The secret of Greenplum distributed database kernel (Part 2)


Click to view the secrets of Greenplum distributed database kernel (Part 1)

1. Distributed actuator

Now that we have the distributed data storage mechanism, we have also generated the distributed query plan. The next step is how to execute the distributed plan in the cluster and finally return the result to the user.

Related concepts of Greenplum actuator

Let’s take a look at an example of SQL and its plan

test=# CREATE TABLE students (id int, name text) DISTRIBUTED BY (id);  
test=# CREATE TABLE classes(id int, classname text, student_id int) DISTRIBUTED BY (id);  
test=# INSERT INTO students VALUES (1, 'steven'), (2, 'changchang'), (3, 'guoguo');  
test=# INSERT INTO classes VALUES (1, 'math', 1), (2, 'math', 2), (3, 'physics', 3);  
test=# explain SELECT student_name, c.classname  
test-# FROM students s, classes c  
test-# WHERE;  
Gather Motion 2:1  (slice2; segments: 2)  (cost=2.07..4.21 rows=4 width=14)  
  ->  Hash Join  (cost=2.07..4.21 rows=2 width=14)  
        Hash Cond: c.student_id =  
        ->  Redistribute Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.09 rows=2 width=10)  
              Hash Key: c.student_id  
              ->  Seq Scan on classes c  (cost=0.00..2.03 rows=2 width=10)  
        ->  Hash  (cost=2.03..2.03 rows=2 width=12)  
              ->  Seq Scan on students s  (cost=0.00..2.03 rows=2 width=12)  
Optimizer status: legacy query optimizer  

The secret of Greenplum distributed database kernel (Part 2)

This figure shows the schematic diagram of the above example when the SQL is executed in the Greenplum cluster of two segments.

QD (query dispatcher, query scheduler): the process responsible for processing user query requests on the master node is called QD (back end process in PostgreSQL). After QD receives the SQL request from the user, it parses, rewrites and optimizes it, distributes the optimized parallel plan to each segment for execution, and returns the final result to the user. In addition, it is also responsible for the communication control and coordination among all QE processes involved in the whole SQL statement. For example, when an error occurs during the execution of a QE, QD is responsible for collecting the error details and canceling all other QES; if limit n statement has been satisfied, all QE execution will be terminated, etc. The entry of QD is exec_ simple_ query()。

QE (query executor): the process responsible for executing query tasks distributed by QD on segment is called QE. Segment instance also runs a PostgreSQL, so for QE, QD is a client of PostgreSQL, and they communicate with each other through the standard libpq protocol of PostgreSQL. For QD, QE is the PostgreSQL backend process responsible for executing its query request. In general, QE performs part of the whole query (called slice). The entrance to QE is exec_ mpp_ query()。

Slice: in order to improve the parallelism and efficiency of query execution, Greenplum divides a complete distributed query plan into multiple slices from bottom to top, and each slice is responsible for a part of the plan. The boundary of slice is motion. Every time motion is encountered, motion is cut into sender and receiver, and two subtrees are obtained. Each slice is processed by a QE process. There are three slices in the above example.

Gang: all QES processes that execute the same slice on different segments are called gang. In the above example, there are two groups of gangs. The first group of gangs is responsible for scanning the table classes on two segments respectively, and redistributing the result data to the second group of gangs; the second group of gangs is responsible for scanning the table students on two segments respectively, hashing and associating with the classes data sent by the first group of gangs to this segment, and sending the final result to the master.

Parallel execution process

The following figure shows the parallel execution process of queries in Greenplum cluster. The graph assumes that there are two segments and two slices in the query plan. There are four QES in total. They communicate with each other through the network.

The secret of Greenplum distributed database kernel (Part 2)

QD and QE are both PostgreSQL backend processes, and their execution logic is very similar. For data operation (DML) statements (the execution logic of data definition statements is simpler), the core execution logic is implemented by executorstart, executorrun and executorend.


  • Executorstart is responsible for the initialization and startup of the executor. Greenplum sends the complete query plan to each QE process in each gang through cdbdispatchplan. Greenplum has two ways to send plans to QE: 1) asynchronous way, using libpq’s asynchronous API to send query plans to QE in a non blocking way; 2) synchronous multi thread way: using libpq’s synchronous API, using multiple threads to send query plans to QE at the same time. GUC gp_ connections_ per_ Thread controls the number of threads used. The default value is 0, which means asynchronous mode is adopted. Greenplum removed asynchronous mode from 6.0.
  • Executorrun starts the executor, executes the code of each operator in the query tree, and returns the result tuple to the client in the style of volcano model. On QD, executorrun calls executeplan to process the query tree. The bottom node of the query tree is a motion operator. Its corresponding function is execmotion, which waits for the results from each QE. QD obtains the tuple from QE, performs some necessary operations (such as sorting) and returns it to the end user.
  • Executorend is responsible for cleaning up the actuator, including checking the results, closing the interconnect connection, etc.

The executorstart / executorrun / executorend function on QE is similar to the PostgreSQL code logic of single node. The main difference is that QE implements a slice in the Greenplum distributed plan, so the root node of the query tree must be a motion node. The corresponding execution function is execmotion, which gets tuples from the bottom of the query tree and sends them to different receivers according to the type of motion. The QE of the lower level Gang sends the result tuple of the motion node to the QE of the upper level Gang, and the motion of the QE of the top level Gang sends the result tuple to the QD. The flow type of motion determines the way of data transmission, there are two kinds: broadcast and redistribution. In broadcast mode, the data is sent to each QE of the upper gang; in redistribution mode, the corresponding QE processing node of the data is calculated according to the redistribution key and sent to the QE.

There are two types of network connections between QD and QE:

  • Libpq: QD transmits control information with each QE through libpq, including sending query plan, collecting error information, handling cancellation operation, etc. Libpq is the standard protocol of PostgreSQL, and Greenplum has enhanced the protocol, such as adding the “m” message type (QD uses this message to send query plan to QE). Libpq is based on TCP.
  • Connect: the data transmission of table tuples between QD and QE, QE and QE is realized by connect. Greenplum has two ways to implement interconnection, one is based on TCP, the other is based on UDP. The default mode is UDP connect.

Direct dispatch optimization

There is a special kind of SQL that only needs a single segment to execute. For example, primary key query: select * from TBL where id = 1;

In order to improve the resource utilization and efficiency, Greenplum optimizes this kind of SQL, which is called direct dispatch optimization: in the phase of generating query plan, the optimizer judges whether the query plan is a direct dispatch type query according to the conditions of distribution key and where clause; in the execution phase, if the plan is a direct dispatch type query, QD The plan will only be sent to a single segment that needs to execute the plan, not to all segments.

2. Distributed transaction

Greenplum uses two-phase commit (2pc) protocol to implement distributed transactions. 2pc is a classic database algorithm, which will not be described here. This section outlines the implementation details of two Greenplum distributed transactions

  • Distributed transaction snapshot: achieving consistency between master and different segments
  • Share local snapshot: achieve consistency among different QES in segment

Distributed snapshot

In a distributed environment, the execution order of SQL on different nodes may be different. For example, in the following example, segment 1 executes sql1 first and then SQL2, so the newly inserted data is not visible to sql1; while segment 2 executes SQL2 first and then sql1, so sql1 can see the newly inserted data. This leads to data inconsistency.

The secret of Greenplum distributed database kernel (Part 2)

Greenplum uses distributed snapshot and local mapping to achieve data consistency across nodes. Greenplum QD process takes on the role of distributed transaction manager. When QD starts a new transaction (starttransaction), it will create a new distributed transaction ID, set time stamp and corresponding state information; when getting snapshot data, QD creates a distributed snapshot and saves it in the current snapshot. Similar to single node snapshot, distributed snapshot records xmin / xmax / XIP and other information, and the structure is as follows:

typedef struct DistributedSnapshot  
DistributedTransactionId xminAllDistributedSnapshots;  
DistributedSnapshotId distribSnapshotId;  
Distributedtransactionid xmin; / * XID < xmin*/  
Distributedtransactionid xmax; / * XID > = xmax is invisible*/  
Number of distributed transactions in int32count; / * inprogressxidarray array*/  
/*Array of executing distributed transactions*/  
DistributedTransactionId        *inProgressXidArray;  
} DistributedSnapshot;

When executing the query, QD serializes the distributed transaction and snapshot information and sends them to QE through libpq protocol. After QE deserialization, the distributed transaction and snapshot information of QD are obtained. This information is used to determine the visibility of tuples (heap tuple satisfies mvcc). All QES participating in the query use the same distributed transaction and snapshot information sent by QD to judge the visibility of tuples, thus ensuring the consistency of the whole cluster data and avoiding the inconsistency in the previous example.

The process of judging the visibility of a tuple to a snapshot on QE is as follows:

  • If the transaction of creating tuple: XID (xmin field in tuple header) has not been committed, distributed transaction and snapshot information are not needed;
  • Otherwise, judge whether the transaction XID of creating tuple is visible to snapshot

    • First, according to the distributed snapshot information. According to the XID of creating tuple, find the corresponding distributed transaction: distribxid from the distributed transaction commit log, and then judge whether distribxid is visible to the distributed snapshot

      • If distribsid < distribsnapshot – > xmin, the tuple is visible
      • If distribsid > distribsnapshot – > xmax, the tuple is not visible
      • If distribsnapshot > inprogressxidarray contains distribsid, the tuple is not visible
      • Otherwise, the tuple is visible
    • If the visibility can’t be judged based on the distributed snapshot, or it doesn’t need to be judged based on the distributed snapshot, the local snapshot information is used. This logic is the same as that of PostgreSQL.

Similar to the commit log clog of PostgreSQL, Greenplum needs to save the commit log of global transactions to determine whether a transaction has been committed. The information is stored in shared memory and persisted in the distributed log directory. In order to improve the efficiency of judging the visibility of local XID and avoid accessing the global transaction commit log every time, Greenplum introduces the local transaction distributed transaction commit cache, as shown in the figure below. Each QE maintains such a cache, through which it can quickly find the global transaction distribxid information corresponding to the local XID, and then judge the visibility according to the global snapshot to avoid frequent access to shared memory or disk.

The secret of Greenplum distributed database kernel (Part 2)

Shared local snapshot

A SQL query plan in Greenplum may contain multiple slices, and each slice corresponds to a QE process. On any segment, different QES of the same session (processing the same user SQL) must have the same visibility. However, each QE process is an independent PostgreSQL backend process, and they do not know each other’s existence, so their transaction and snapshot information are different. As shown in the figure below.

The secret of Greenplum distributed database kernel (Part 2)

In order to ensure the consistency of visibility across slices, Greenplum introduces the concept of “shared local snapshot”. Different QES executing the same SQL on each segment share session and transaction information through shared memory data structure sharedsnapshotslot. These processes are called segmate process groups.

Greenplum divides QE in segmate process group into QE writer and QE reader. There is only one QE writer, and there can be no or more QE readers. QE writer can modify the state of the database; QE reader cannot modify the state of the database, and needs to use the same snapshot information as QE writer to maintain the consistent visibility with QE writer. As shown in the figure below.

The secret of Greenplum distributed database kernel (Part 2)

“Share” means that the snapshot is shared between QE writers and readers, “local” means that the snapshot is a local snapshot of segment, and the same user session can have different snapshots on different segments. There is an area in the shared memory of segment to store the shared snapshot, which is divided into many slots. A segmate process group corresponds to a slot through a unique session ID flag. A segment may have multiple segmate process groups, and each process group corresponds to a user’s session, as shown in the figure below.

The secret of Greenplum distributed database kernel (Part 2)

After QE writer creates a local transaction, it obtains a sharedlocalsnapshot slot in shared memory, copies its own local transaction and snapshot information to the shared memory slot, and other QE readers in segmate process group obtain the transaction and snapshot information from the shared memory. Reader QES will wait for writer QE until the writer is set to share local snapshot information. Only the QE writer participates in the global transaction, and only the QE needs to process the transaction commands such as commit / abort.

3. Shuffle data

The data transmission between adjacent gangs is called data shuffling. Data shuffling is consistent with slice’s hierarchy. Data is transmitted through the network from bottom to top, but not across layers. According to different motion types, there are different implementation methods, such as broadcast and redistribution.

Greenplum’s technology of data shuffling is called interconnect. It provides high-speed parallel data transmission service for QES without disk IO operation. It is one of the important technologies of Greenplum’s high-performance query execution. Connect is only used to transmit data (tuple of form), scheduling, control and error handling information are transmitted through libpq connection between QD and QE.

There are two ways to implement TCP and UDP, TCP interconnect takes up a lot of port resources in large-scale cluster, so its scalability is low. Greenplum uses UDP by default. UDP interconnect supports traffic control, network packet retransmission and acknowledgement.

4. Distributed cluster management

The distributed cluster contains multiple physical nodes, ranging from four or five to hundreds. The complexity of managing so many machines is much greater than a single PostgreSQL database. In order to simplify the management of database cluster, Greenplum provides a lot of tools. Some common tools are listed below. For more information about these tools, please refer to the official documents of Greenplum database administrator.

  • Gpactivatestandby: activate standby master to be the master of Greenplum database cluster.
  • Gpaddmirrors: add mirror nodes to Greenplum cluster to improve high availability
  • Gpcheckcat: check the system table of Greenplum database to assist in fault analysis.
  • Gpcheckperf: check the system performance of Greenplum cluster, including the performance of disk, network and memory.
  • Gpconfig: configure parameters for all nodes in Greenplum cluster.
  • Gpdelete system: delete the entire Greenplum cluster
  • Gpexpand: add new machines to Greenplum cluster for capacity expansion.
  • Gpfdist: Greenplum’s file distributor, is the most important tool to load and unload Greenplum’s data. Gpfdist makes full use of parallel processing and has very high performance.
  • Gpload: encapsulates gpfdist and external table information. By configuring yaml file, data can be easily loaded into Greenplum database. Three modes of insert, update and merge are supported.
  • Gpinitstandby: initialize standby master for Greenplum cluster
  • Gpinitsystem: initialize Greenplum cluster
  • Gppkg: package management tool provided by Greenplum, which can easily install Greenplum packages on all nodes, such as PostGIS, PLR, etc.
  • Gprecover: recover the failed primary segment node or mirror segment node
  • Gpssh / gpscp: the standard SSH / SCP can only perform remote command execution and file copy operations for one target machine. Gpssh can execute the same command on a group of machines at the same time; gpscp copies a file or directory to multiple target machines at the same time. Many Greenplum command-line tools use these two tools to implement cluster parallel command execution.
  • Gpstart: start a Greenplum cluster.
  • Gpstop: stop a Greenplum cluster
  • Gpstate: displays the status of Greenplum cluster
  • Gpcopy: migrate data from one Greenplum database to another.
  • gp_ dump/gp_ Restore: Greenplum data backup and recovery tool. Starting from Greenplum 5. X, we recommend a new backup and recovery tool: gpbackup / gprestore.
  • Packcore: packcore can make a core dump file and all its dependencies into a package, which can be debugged on other machines. A very useful debugging tool.
  • : convert the text result of explain into picture. The plan tree images used in this section are generated using this tool.

5. Hands on practice

The above outlines the six aspects involved in turning a single PostgreSQL database into a distributed database. If you are interested in more details, the most effective way is to change the code to implement some new features. The following items can be used as reference:

  • Data storage: implement partial table, so that a table or a database only uses a subset of the cluster. For example, if the cluster has 200 nodes, you can create a table or database with only 10 nodes.
  • Resource Management: at present, gang can only be shared within one session, realizing cross session sharing or gang sharing pool.
  • Scheduling: at present, the dispatcher sends the whole plan to each QE, and can send a single slice to the QE responsible for executing the slice
  • Performance Optimization: analyze the performance bottleneck of Greenplum distributed execution, and further optimize, especially the performance optimization of OLTP query, to achieve higher TPS.
  • Actuator Optimization: at present, Greenplum uses zstd to compress Ao data and temporary data. One of the problems caused by zstd is the large memory consumption. How to optimize the memory consumption when operating a large number of compressed files is a very challenging topic.

You can refer to this discussion for more details (there is a simple problem recurrence method in the last part)… )。

Those interested in these projects can contact [email protected] Provide more advice or help. To achieve any of the above functions, you can join the Greenplum kernel development team through the fast track to meet the challenges and share the joy.

6. Postscript

After thinking about it for a while, I wanted to write something to attract more people to join the ranks of database kernel development. However, I don’t think it’s self-contained, so let it go.

Greenplum has a strong wine culture. Let’s end with a story about wine.

In 13 / 14 years or so, I was fortunate to have a lot of connections with the senior database veteran Dan Holle (Teradata CTO, employee No. 7). The old man has been engaged in MPP database for more than 30 years. Put at least two bottles of wine in front of you every time you drink, and drink two bottles of wine at the same time. This is parallel processing. If you drink too much in one bottle, you must pick up another bottle and drink a little more, so as to ensure that the balance of the two bottles is consistent. This is to avoid tilting. Often the left bottle drinks a little more, pick up the right bottle to fill a mouthful, fill more, then pick up the left bottle to fill a mouthful. So the left and right complement each other, without Shandong people to persuade wine, he quickly entered the state.

The old man’s love for MPP database has been integrated into his life, which is admirable. And it is many of these decades of artisans who have made the brilliant achievements in today’s database field. We are looking forward to more people joining us!

The secret of Greenplum distributed database kernel (Part 2)