Only 4 steps can detect and eliminate distributed deadlock through SQL

Time:2021-3-3

abstractThis paper mainly introduces how to detect and recover distributed deadlock by SQL statement in gaussdb (DWS).

In the application scenario of distributed data warehouse, we often encounter the problem of hang in the database system. The so-called hang means that although the database system is still running, part or all of the business cannot be executed normally. There are many reasons for hang problem, among which distributed deadlock is the most common. This time, we mainly share how to quickly solve the deadlock problem when encountering deadlock.

As a distributed data warehouse, gaussdb (DWS) implements concurrency control through lock mechanism, so it is possible to generate distributed deadlock. Although distributed deadlock can not be avoided, fortunately, it provides a variety of system views to ensure that the deadlock can be located quickly after the distributed deadlock occurs.

This paper mainly introduces how to detect and recover distributed deadlock by SQL statement in gaussdb (DWS). The method is divided into four steps

1. Collect the lock information of each node.

2. Build a waiting relationship.

3. Detection cycle waiting.

4. Abort the transaction to eliminate the deadlock.

The method introduced in this paper is simple to use and low threshold, which can ensure that after the distributed deadlock occurs, it can quickly solve the problem and recover the business.

Detection and elimination of distributed deadlock through SQL statements

Comparison of distributed deadlock and single node deadlock

Single node deadlock

Single node deadlock means that all the lock waiting information in a deadlock comes from the same node

–Transaction transaction1 — node: cN1

BEGIN;

TRUNCATE t1;
EXECUTE DIRECT ON(DN1) ‘SELECT * FROM t2’;

Commit; — transaction2 — node: cN1

BEGIN;

TRUNCATE t2;
EXECUTE DIRECT ON(DN2) ‘SELECT * FROM t1’;

COMMIT;

Suppose that the execution order of the above two transactions is as follows:

  1. [transaction1] TRUNCATE t1
  2. [transaction2] TRUNCATE t2
  3. [transaction1] EXECUTE DIRECT ON(DN1) ‘SELECT * FROM t2’
  4. [transaction2] EXECUTE DIRECT ON(DN2) ‘SELECT * FROM t1’

The execution sequence will lead to deadlock. Since transactions transaction1 and transaction2 are executed on cN1 and all the lock waiting information in the deadlock is on cN1, the deadlock is a single node deadlock.

Only 4 steps can detect and eliminate distributed deadlock through SQL

Gaussdb (DWS) supports automatic processing of single node deadlock. When multiple transactions on a node fall into the cycle waiting, the database system will automatically abort one of them, so as to eliminate the deadlock.

Only 4 steps can detect and eliminate distributed deadlock through SQL

distributed deadlock

Distributed deadlock means that the lock waiting information in deadlock comes from different nodes. For example:

–Transaction transaction1 — node: cN1

BEGIN;

TRUNCATE t1;
EXECUTE DIRECT ON(DN1) ‘SELECT * FROM t2’;

Commit; — transaction transaction2 — node: cn2

BEGIN;

TRUNCATE t2;
EXECUTE DIRECT ON(DN2) ‘SELECT * FROM t1’;

COMMIT;

Compared with the example in the previous section, only the node of transaction transaction2 is changed from cN1 to cn2.

Suppose that the execution order of two transactions is the same as that in the previous section, or deadlock will occur. The lock waiting information in deadlock is as follows:

Only 4 steps can detect and eliminate distributed deadlock through SQL

This is a typical distributed deadlock. If you look at the lock waiting information on cN1 or cn2 alone, you can’t see any deadlock. But if you look at the lock waiting information of multiple nodes together, you can find the phenomenon of cyclic waiting.

When a distributed deadlock occurs, all the transactions trapped in the deadlock cannot continue to execute. Only when one of the transaction locks has timed out can the remaining transactions continue to execute. By default, the lock wait timeout is 20 minutes.

Only 4 steps can detect and eliminate distributed deadlock through SQL

Detection and elimination of distributed deadlock

When we observe the Hang Problem in the database system, we need to detect the distributed deadlock through SQL statements. If we find that there is a distributed deadlock, we also need to eliminate the deadlock. Next, taking the previous distributed deadlock as an example, this paper introduces the detection and elimination methods of distributed deadlock.

Collect the lock information of each node

In order to detect distributed deadlock, we need to obtain the lock information of each node. In gaussdb (DWS), PG can be used to_ The locks view queries the lock information of the current node, so you can query PG in all nodes through the execute direct statement_ Locks view and collect it into the current node.

Notice one detail here, pg_ In locks view, a lot of information is given in oid type, such as a lock on a table, pg_ The locks view gives the oid of the table. Since the oid of the same table in each node is not necessarily the same, it is impossible to identify a table by oid. When collecting lock information, you need to first convert the oid of the table to schema name plus table name. Other oid information, such as partition oid, also needs to be converted to corresponding names.

Execute the sample code pgxc in the attachment_ locks.sql Then the lock information of each node can be collected

locktype | nodename | datname | usename | nspname | relname | partname | page | tuple | virtualxid | transactionid | virtualtransaction | mode | granted | client_addr | application_name | pid | xact_start | query_start | state | query_id | query —————+————–+———-+———+———+———+———-+——+——-+————+—————+——————–+———————+———+————-+——————+—————–+—————————-+—————————-+———————+——————-+—————————————————– virtualxid | cn_5002 | postgres | tyx_1 | | | | | | 12/94 | | 12/94 | ExclusiveLock | t | | gsql | 140110481323776 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:19:37.715447 | active | 0 | EXECUTE DIRECT ON(dn_6003_6004) ‘SELECT * FROM t1’;
virtualxid | cn_5002 | postgres | tyx_1 | | | | | | 9/298 | | 9/298 | ExclusiveLock | t | ::1/128 | cn_5001 | 140110672164608 | 2020-12-25 17:18:40.478704 | 2020-12-25 17:18:40.479682 | idle in transaction | 0 | TRUNCATE t1;
virtualxid | cn_5002 | postgres | tyx_1 | | | | | | 6/161 | | 6/161 | ExclusiveLock | t | | WLMArbiter | 140110762325760 | 2020-12-25 17:20:18.613815 | 2020-12-25 16:53:35.027585 | active | 0 | WLM arbiter sync info by CCN and CNs
virtualxid | cn_5002 | postgres | tyx_1 | | | | | | 5/162 | | 5/162 | ExclusiveLock | t | | WorkloadMonitor | 140110779119360 | 2020-12-25 17:20:27.16458 | 2020-12-25 16:53:35.027217 | active | 0 | WLM monitor update and verify local info
virtualxid | cn_5002 | postgres | tyx_1 | | | | | | 3/325 | | 3/325 | ExclusiveLock | t | | workload | 140110846744320 | 2020-12-25 17:20:25.372654 | 2020-12-25 16:53:35.02741 | active | 72339069014641297 | WLM fetch collect info from data nodes
advisory | cn_5002 | postgres | tyx_1 | | | | | | | | 12/94 | ShareLock | t | | gsql | 140110481323776 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:19:37.715447 | active | 0 | EXECUTE DIRECT ON(dn_6003_6004) ‘SELECT * FROM t1’;
relation | cn_5002 | postgres | tyx_1 | public | t1 | | | | | | 9/298 | AccessExclusiveLock | t | ::1/128 | cn_5001 | 140110672164608 | 2020-12-25 17:18:40.478704 | 2020-12-25 17:18:40.479682 | idle in transaction | 0 | TRUNCATE t1;
relation | cn_5002 | postgres | tyx_1 | public | t1 | | | | | | 12/94 | AccessShareLock | f | | gsql | 140110481323776 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:19:37.715447 | active | 0 | EXECUTE DIRECT ON(dn_6003_6004) ‘SELECT * FROM t1’;
transactionid | cn_5002 | postgres | tyx_1 | | | | | | | 10269 | 12/94 | ExclusiveLock | t | | gsql | 140110481323776 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:19:37.715447 | active | 0 | EXECUTE DIRECT ON(dn_6003_6004) ‘SELECT * FROM t1’;
transactionid | cn_5002 | postgres | tyx_1 | | | | | | | 10266 | 9/298 | ExclusiveLock | t | ::1/128 | cn_5001 | 140110672164608 | 2020-12-25 17:18:40.478704 | 2020-12-25 17:18:40.479682 | idle in transaction | 0 | TRUNCATE t1;
relation | cn_5002 | postgres | tyx_1 | public | t2 | | | | | | 12/94 | AccessExclusiveLock | t | | gsql | 140110481323776 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:19:37.715447 | active | 0 | EXECUTE DIRECT ON(dn_6003_6004) ‘SELECT * FROM t1’;
virtualxid | dn_6001_6002 | postgres | tyx_1 | | | | | | 17/433 | | 17/433 | ExclusiveLock | t | ::1/128 | cn_5001 | 140552375822080 | 2020-12-25 17:18:40.478704 | 2020-12-25 17:18:50.513948 | idle in transaction | 0 | TRUNCATE t1;
virtualxid | dn_6001_6002 | postgres | tyx_1 | | | | | | 23/692 | | 23/692 | ExclusiveLock | t | ::1/128 | cn_5002 | 140552359040768 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:18:56.830053 | idle in transaction | 0 | TRUNCATE t2;
virtualxid | dn_6001_6002 | postgres | tyx_1 | | | | | | 2/1607 | | 2/1607 | ExclusiveLock | t | | workload | 140552945264384 | | 2020-12-25 16:53:35.041283 | active | 0 | WLM fetch collect info from data nodes
transactionid | dn_6001_6002 | postgres | tyx_1 | | | | | | | 10266 | 17/433 | ExclusiveLock | t | ::1/128 | cn_5001 | 140552375822080 | 2020-12-25 17:18:40.478704 | 2020-12-25 17:18:50.513948 | idle in transaction | 0 | TRUNCATE t1;
relation | dn_6001_6002 | postgres | tyx_1 | | | | | | | | 23/692 | AccessExclusiveLock | t | ::1/128 | cn_5002 | 140552359040768 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:18:56.830053 | idle in transaction | 0 | TRUNCATE t2;
relation | dn_6001_6002 | postgres | tyx_1 | | | | | | | | 17/433 | AccessExclusiveLock | t | ::1/128 | cn_5001 | 140552375822080 | 2020-12-25 17:18:40.478704 | 2020-12-25 17:18:50.513948 | idle in transaction | 0 | TRUNCATE t1;
relation | dn_6001_6002 | postgres | tyx_1 | public | t2 | | | | | | 23/692 | ShareLock | t | ::1/128 | cn_5002 | 140552359040768 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:18:56.830053 | idle in transaction | 0 | TRUNCATE t2;
relation | dn_6001_6002 | postgres | tyx_1 | public | t2 | | | | | | 23/692 | AccessExclusiveLock | t | ::1/128 | cn_5002 | 140552359040768 | 2020-12-25 17:18:54.238933 | 2020-12-25 17:18:56.830053 | idle in transaction | 0 | TRUNCATE t2;
Omit several lines
(55 rows)

Building a waiting relationship

After the lock information of each node is collected, the waiting relationship can be constructed.

There are three conditions for transaction a to wait for transaction B:

  1. The resources locked by two transactions are the same (the same table, the same partition, the same page or the same tuple, etc.). In particular, if transaction a locks the T1 table of DN 1 and transaction B locks the T1 table of DN 2, we think that the resources they lock are different. Only the same resource on the same node is considered to be the same resource.
  2. Transaction B already holds the lock, while transaction a does not.
  3. The levels of locks requested by transaction a and transaction B are mutually exclusive.

By processing the lock information collected in the previous step, the transaction waiting relationship can be constructed.

Execute the sample code pgxc in the attachment_ locks_ wait.sql Then the waiting relationship can be obtained

locktype | nodename | datname | acquire_lock_pid | hold_lock_pid | acquire_lock_event | hold_lock_event ———-+———-+———-+——————+—————–+————————————————————————-+——————————————————– relation | cn_5001 | postgres | 140508814374656 | 140508792350464 | usename : tyx_1 +| usename : tyx_1 +

      |          |          |                  |                 | nspname           : public                                             +| nspname           : public                            +
      |          |          |                  |                 | relname           : t2                                                 +| relname           : t2                                +
      |          |          |                  |                 | partname          :                                                    +| partname          :                                   +
      |          |          |                  |                 | page              :                                                    +| page              :                                   +
      |          |          |                  |                 | tuple             :                                                    +| tuple             :                                   +
      |          |          |                  |                 | virtualxid        :                                                    +| virtualxid        :                                   +
      |          |          |                  |                 | transactionid     :                                                    +| transactionid     :                                   +
      |          |          |                  |                 | virtualtransaction: 11/13                                              +| virtualtransaction: 12/1323                           +
      |          |          |                  |                 | mode              : AccessShareLock                                    +| mode              : AccessExclusiveLock               +
      |          |          |                  |                 | client_addr       :                                                    +| client_addr       : ::1/128                           +
      |          |          |                  |                 | application_name  : gsql                                               +| application_name  : cn_5002                           +
      |          |          |                  |                 | xact_start        : 2020-12-25 17:18:40.478704                         +| xact_start        : 2020-12-25 17:18:54.238933        +
      |          |          |                  |                 | query_start       : 2020-12-25 17:19:23.0923                           +| query_start       : 2020-12-25 17:18:54.239319        +
      |          |          |                  |                 | state             : active                                             +| state             : idle in transaction               +
      |          |          |                  |                 | query_id          : 0                                                  +| query_id          : 0                                 +
      |          |          |                  |                 | query             : EXECUTE DIRECT ON(dn_6001_6002) 'SELECT * FROM t2';+| query             : TRUNCATE t2;                      +
      |          |          |                  |                 | ------------------------------------------------------                  | ------------------------------------------------------ relation | cn_5002  | postgres |  140110481323776 | 140110672164608 | usename           : tyx_1                                              +| usename           : tyx_1                             +
      |          |          |                  |                 | nspname           : public                                             +| nspname           : public                            +
      |          |          |                  |                 | relname           : t1                                                 +| relname           : t1                                +
      |          |          |                  |                 | partname          :                                                    +| partname          :                                   +
      |          |          |                  |                 | page              :                                                    +| page              :                                   +
      |          |          |                  |                 | tuple             :                                                    +| tuple             :                                   +
      |          |          |                  |                 | virtualxid        :                                                    +| virtualxid        :                                   +
      |          |          |                  |                 | transactionid     :                                                    +| transactionid     :                                   +
      |          |          |                  |                 | virtualtransaction: 12/94                                              +| virtualtransaction: 9/298                             +
      |          |          |                  |                 | mode              : AccessShareLock                                    +| mode              : AccessExclusiveLock               +
      |          |          |                  |                 | client_addr       :                                                    +| client_addr       : ::1/128                           +
      |          |          |                  |                 | application_name  : gsql                                               +| application_name  : cn_5001                           +
      |          |          |                  |                 | xact_start        : 2020-12-25 17:18:54.238933                         +| xact_start        : 2020-12-25 17:18:40.478704        +
      |          |          |                  |                 | query_start       : 2020-12-25 17:19:37.715447                         +| query_start       : 2020-12-25 17:18:40.479682        +
      |          |          |                  |                 | state             : active                                             +| state             : idle in transaction               +
      |          |          |                  |                 | query_id          : 0                                                  +| query_id          : 0                                 +
      |          |          |                  |                 | query             : EXECUTE DIRECT ON(dn_6003_6004) 'SELECT * FROM t1';+| query             : TRUNCATE t1;                      +
      |          |          |                  |                 | ------------------------------------------------------                  | ------------------------------------------------------ (2 rows)

Waiting relation judgment

After the transaction waiting relationship is constructed, we can judge whether there is a distributed deadlock by checking whether the waiting relationship is looped.

In general, there are not too many waiting relationships. We can judge whether there is a distributed deadlock by observing. By observing the waiting information built in the previous section, it is easy to judge that transaction 1 and transaction 2 have a cyclic waiting, that is, deadlock.

deadlock resolution

In the previous step, one or more rings in the waiting relationship may be found. For each ring, a transaction in the ring needs to be aborted to eliminate the deadlock. As for which transaction in the ring should be chosen to abort, we need to consider the importance of the transaction, the executed time and other aspects, and finally choose a transaction that has the least impact on the business to abort.

summary

Through SQL statements, we can easily handle distributed deadlock. When we encounter Hang Problem in the actual business, we can check whether the hang problem is caused by distributed deadlock with the help of the method provided in this paper. If the problem is really caused by distributed deadlock, we can also quickly recover the business by suspending a deadlock event.

Attachment Download:lock.zip 2.29KB

This article is from Huawei cloud community “how to detect and eliminate distributed deadlock through SQL”, original author: tyxxjtu.

Click follow to learn about Huawei’s new cloud technology for the first time~