What is the game PostgreSQL distributed architecture?


1、 What is distributed database
Description in the principle of distributed system database system (Third Edition): “we define distributed database as a group of logically interrelated databases distributed on computer network. Distributed database management system (distributed DBMS) It is a software system that supports the management of distributed database, which makes the distribution transparent to users. Sometimes, distributed database system (DDBS) is used to represent both distributed database and distributed DBMS. “

In the above expression, “a group of people are distributed on the network and logically related to each other” is its essence. Physically, a group of logically related databases can be distributed on one or more physical nodes. Of course, it is mainly applied to multiple physical nodes. On the one hand, this is related to the improvement of the cost performance of X86 server. On the other hand, the development of the Internet has brought the demand for high concurrency and massive data processing. The original single physical server node is not enough to meet this demand.

Distributed is not only reflected in the field of database, but also related to distributed storage, distributed middleware and distributed network. The ultimate goal is to better serve the change of business requirements. In the philosophical sense, it is an improvement of productivity.

2、 Theoretical basis of distributed database

1. Cap theory
Firstly, the technical theory of distributed database is based on the inheritance of the basic characteristics of single node relational database, which mainly involves the acid characteristics of transactions, disaster recovery of transaction logs and high availability of data redundancy.

Secondly, the design of distributed data should follow the cap theorem, that is, a distributed system cannot meet the three basic requirements of consistency, availability and partition tolerance at the same time, and can only meet two of them at most. Partition fault tolerance cannot be abandoned, Therefore, architects usually trade off between availability and consistency. The trade-off here is not simply to abandon it completely, but to consider the sacrifice made by the business situation, or to describe it with the term “degradation” of the Internet.

For cap theory, we consulted relevant foreign documents. Cap theory comes from the formal proof of brewer conjecture published by Seth Gilbert and Nancy Lynch of MIT in 2002.

The three characteristics of cap are described as follows:

Consistency: ensure that each node in the distributed cluster returns the same and most recently updated data. Consistency means that each client has the same data view. There are many types of consistency models. Consistency in cap refers to linearization or sequential consistency, which is strong consistency.

Availability: each non failed node returns responses to all read and write requests within a reasonable time. In order to be available, each node on both sides of the network partition must be able to respond in a reasonable time.

Partition tolerance: Despite the existence of network partitions, the system can continue to operate and ensure consistency. Network partitioning has become a reality. The distributed system with partition tolerance can be properly restored from the partition after partition repair.

The main point of the original text is to emphasize that cap theory can not be simply understood as two out of three.

Partition tolerance is necessary in distributed database management system. Network partitioning and dropped messages have become a fact and must be handled appropriately. Therefore, system designers must make a trade-off between consistency and availability. Simply put, network partitioning forces designers to choose perfect consistency or perfect availability. In a given situation, an excellent distributed system will provide the best answer according to the important level of business requirements for consistency and availability, but generally, the level of consistency requirements will be higher and the most challenging.

2. Base theory
Based on the trade-off of cap theorem, base theory is introduced. Base is the abbreviation of three phrases: basically available, soft state and eventually consistent. The core idea of base theory is that even if strong consistency cannot be achieved, each application can adopt appropriate methods to achieve the final consistency of the system according to its own business characteristics.

Ba: basically available. When the distributed system fails, it is allowed to lose some availability, that is, to ensure the core availability.

S: Soft state, which allows the system to have an intermediate state, which will not affect the overall availability of the system.

E: Consistency: after a certain period of time, all data copies in the system can finally reach a consistent state.

Base theory is essentially an extension of cap theory and a supplement to AP scheme in cap.

Here is a supplementary explanation of what is strong consistency:

Strict consistency, also known as www.cungun.comatomic consistency or linear consistency, must meet the following two requirements:

1. The last written data of a data can be read in any reading.

2. The operation sequence of all processes in the system is consistent with that under the global clock.

For relational databases, it is required that the updated data can be seen by subsequent access, which is a strong consistency. In short, at any time, the data in all nodes is the same.

The final consistency of base theory belongs to weak consistency.

Next, we introduce another important concept of distributed database: distributed transaction. After browsing several articles on distributed transactions, I found that there will be different descriptions, but the general meaning is the same. Distributed transactions are first transactions, which need to meet the acid characteristics of transactions. It is mainly considered that the data processed by business access is scattered on multiple nodes between networks. For the distributed database system, under the requirements of ensuring data consistency, it can distribute transactions and cooperate with multiple nodes to complete business requests.

Whether multi nodes can work together normally and smoothly to complete transactions is the key, which directly determines the consistency of access data and the timeliness of response to requests. Therefore, it needs scientific and effective consistency algorithm to support.

3. Consistency algorithm
At present, the main consistency algorithms include 2pc, 3pc, Paxos and raft.

2pc: two phase commit is also considered as a consistency protocol to ensure the consistency of distributed system data. Most relational databases use two-phase commit protocol to complete distributed transaction processing.

It mainly includes the following two stages:

Stage 1: submit transaction request (voting stage)

Phase II: execution transaction submission (execution phase)

Advantages: simple principle and convenient implementation

Disadvantages: synchronization blocking, single point problem, inconsistent data, too conservative

3pc: Three – phase commit (three-phase commit) includes three phases: cancommit, precommit and docommit.

In order to avoid that when all participants are notified to commit a transaction, when one of the participants crashes inconsistently, a three-stage commit method appears.

The three-phase submission adds a precommit process to the two-phase submission. When all participants receive the precommit, it will not be executedgameAction until a commit is received or after a certain period of time.

Advantages: reduce the blocking range of participants and continue to reach an agreement after a single point of failure. Disadvantages: the precommit stage is introduced. In this stage, if a network partition occurs, the coordinator cannot communicate with the participants normally, and the participants will still commit transactions, resulting in data inconsistency.

2pc / 3pc protocol is used to ensure the atomicity of operations on multiple data slices.

These data fragments may be distributed on different servers. 2pc / 3pc protocol ensures that the operations on multiple servers either succeed or fail.

Paxos, raft and Zab algorithms are used to ensure data consistency among multiple copies of the same data slice. The following is an overview of the three algorithms.

Paxos algorithm mainly solves the single point problem of data fragmentation. The purpose is to make the nodes of the whole cluster agree on the change of a value. Paxos (strong consistency) belongs to the majority algorithm. Any point can propose a proposal to modify some data. Whether the proposal is passed depends on whether more than half of the nodes in the cluster agree. Therefore, Paxos algorithm requires that the nodes in the cluster are singular.

Raft algorithm is a simplified version of Paxos. Raft is divided into three sub problems: one is leader election; Second, log replication; Third, safety. Raft defines three roles: leader, follower and candidate. At first, everyone is a follower. When the follower cannot listen to the leader, he can become a candidate, initiate a vote and elect a new leader.

It has two basic processes:

① Leader election: every C andidate will put forward an election plan after a certain period of time. The one who gets the most votes in the recent stage is selected as l eader.

② Synchronize log: l eader will find the latest log (occurrence record of various events) in the system and force all follow to refresh to this record.

The raft consistency algorithm simplifies the management of log copies by selecting a leader. For example, log entries are only allowed to flow from leader to follower. Zab is basically the same as raft.

3、 PostgreSQL Distributed Architecture Overview (1) what is PostgreSQL XL
Postgres XL is an open source PG clustering software. XL stands for extensible lattice, which means extended PG “lattice”, hereinafter referred to as pgxl.

Officials say it is not only suitable for OLTP applications with high write operation pressure, but also suitable for big data applications dominated by read operation. Its predecessor is Postgres XC (pgxc for short). Pgxc adds the cluster function on the basis of PG, which is mainly applicable to OLTP applications. Pgxl is an upgraded product based on pgxc, adding some features suitable for OLAP applications, such as massively parallel processing (MPP).

Generally speaking, pgxl code contains PG code. Installing PG cluster with pgxl does not require installing PG separately. One problem caused by this is that you can’t choose any version of PG at will. Fortunately, pgxl follows up PG in a timely manner. At present, the latest version of Postgres XL 10r1 is based on PG 10.

Community development history:

From 2004 to 2008, NTT data built the model Rita dB

In 2009, NTT data cooperated with enterprisedb for community development

In 2012, Postgres XC 1.0 was officially released

In 2012, stormdb added MPP function based on XC

In 2013, XC 1.1 was released; Trans lattice acquires stormdb

In 2014, XC 1.2 was released; Stormdb is open source Postgres XL

In 2015, the two communities merged into postgres-x2

February 2016, Postgres XL 9.5 R1

July 2017, Postgres XL 9.5 r1.6

October 2018, Postgres XL 10r1

In February 2019, Postgres XL 10r1. 1 was announced

As can be seen from the above figure, the coordinator and datanode nodes can be configured as multiple nodes and can be located on different hosts. Only the coordinator node directly provides application services. The coordinator node distributes and stores data on multiple data nodes.

The main components of Postgres XC are GTM (global transaction manager)_ standby , gtm_ Proxy, coordinator and datanode.

Global transaction node (GTM), the core component of Postgres Xc, is used for global transaction control and tuple visibility control. GTM is a module for allocating gxid and managing pgxc mvcc. There can only be one master GTM in a cluster. gtm_ Standby is the standby of GTM.

Main functions:

– generate a globally unique transaction ID

– status of global transactions

– global information such as sequence

gtm_ Proxy was born to reduce the pressure of GTM. It is used to group the tasks submitted by the coordinator node. There can be multiple GTMS in the machine_ proxy。

The coordinator is the interface between the data node and the application. It is responsible for receiving user requests, generating and executing distributed queries, and sending SQL statements to the corresponding data nodes.

The coordinator node does not physically store the table data. The table data is distributed in the form of fragmentation or replication, and the table data is stored on the data node. When an application initiates SQL, it will first reach the coordinator node, and then the coordinator node will distribute the SQL to each data node to summarize the data. This system process is controlled through gxid and global snapshot.

Data nodes physically store table data. Table data storage methods are divided into distributed and replicated. Data nodes store only local data.

data distribution

• replicated table

– tables replicate across multiple nodes

• distributed table

– Hash

– Round robin

Note: Round Robin rotation is the simplest division method: that is, each tuple will be placed on the next node in turn, as shown in the figure below, so as to cycle.

(3) Main site

2. Extend distributed solution citus
(1) What is citus
Citus is an open-source distributed database based on PostgreSQL, which automatically inherits the powerful SQL support capability and application ecology of PostgreSQL (not only the compatibility of client protocols, but also the full compatibility of server-side expansion and management tools). Citus is not a fork of PostgreSQL. It adopts the shared nothing architecture. There is no shared data between nodes. A database cluster is composed of coordinator nodes and work nodes. Focus on high-performance HTAP distributed database.

Compared with stand-alone PostgreSQL, citus can use more CPU cores, more memory and save more data. By adding nodes to the cluster, you can easily expand the database.

Compared with other similar PostgreSQL based distributed solutions, such as Greenplum and PostgreSQL XL, citus is a PostgreSQL extension rather than an independent code branch. Citus can keep up with the evolution of PostgreSQL version with little cost and faster speed; At the same time, it can ensure the stability and compatibility of the database to the greatest extent.

Citus supports the features of the new version of PostgreSQL and remains compatible with existing tools. Citus uses sharding and replication to scale out PostgreSQL across multiple machines. Its query engine will execute SQL on these servers for parallel queries to achieve real-time (less than a second) responses on large data sets.

Citus is currently divided into the following versions:

Citus Community Edition

Citus commercial edition

Cloud [AWS,citus cloud]

(2) Technical architecture

This screenshot refers to the practice sharing of Chen Huajun citus in Suning in March 2020

Citus cluster consists of a central coordination node (CN) and several work nodes (workers).

Cn only stores metadata related to data distribution. The actual table data is divided into M pieces and scattered to n workers. Such a table is called a “shard table”, which can create multiple copies for each shard of the “shard table” to achieve high availability and load balancing.

The official documents of citus recommend using PostgreSQL native stream replication as ha. Multi copy HA may only be applicable to the fragment of append only.

The application sends the query to the coordinator node, and the coordinator processes it and sends it to the work node. For each query, the coordinator routes it to a single work node or parallelizes it, depending on whether the data is on a single node or multiple nodes. Citus MX mode allows direct access to work nodes for faster read and write speeds.

The partitioned table mainly solves the problem of horizontal capacity expansion of large tables. For the dimension tables that do not have a large amount of data and often need to join with the partitioned table, a special partitioning strategy can be adopted. Only one slice is divided and one copy is deployed on each worker. Such tables are called “reference tables”.

In addition to the fragmented table and reference table, there is still a PostgreSQL native table without fragmentation, which is called “local table”. “Local table” is applicable to some special scenarios, such as highly concurrent small table queries.

Client applications only interact with CN nodes when accessing data. After receiving the SQL request, CN generates a distributed execution plan and distributes each subtask to the corresponding worker node. Then, it collects the worker’s results and returns the final results to the client after processing.

Recommended Today

Seven solutions for distributed transactions

1、 What is distributed transaction Distributed transaction means that transaction participants, transaction supporting servers, resource servers and transaction managers are located on different nodes of different distributed systems. A large operation is completed by more than n small operations. These small operations are distributed on different services. For these operations, either all of them are […]