Tikv source code analysis series article (12) distributed transaction


Author: Zhou Zhenjing

In the previous article, we have introduced the service layer and storage layer of tikv. I believe you have generally understood that the transaction related codes of tikv are located in the storage layer. This paper will explain the theory and implementation details of tikv’s transaction algorithm in more depth.


Tikv adopts the transaction model described in the paper Google percolator, which we explained in the overview of tikv transaction model and deep dive tikv – percolator. In order to better understand the following content, we suggest that you read the above materials first.

In the design of percolator, the algorithms of distributed transactions are all in the client code, which directly accesses BigTable. The design of tikv is similar to that of percolator in this respect. Tikv accepts read-write requests in region units. The logic that needs to cross regions is in the client of tikv, such as tidb. The client’s code will segment the request and send it to the corresponding region. In other words, the correct transaction requires close cooperation between the client and tikv. In order to explain the complete transaction process, this article also mentions the code of tikv client part of tidb (located in thestore/tikvTable of contents), you can also refer to the introduction of tikv client in the 18th and 19th articles of tidb source reading series. We also have separate client libraries in multiple languages, all of which are still under development.

Tikv’s transaction is an optimistic one. Only when a transaction is finally committed can it go through the two-stage commit process. Support for pessimistic transactions is currently being improved, and a separate article will introduce the implementation of pessimistic transactions.

Process of transaction

Because the optimistic transaction model is adopted, the write will be cached in a buffer, and the data will not be written to tikv until the final commit; and a transaction should be able to read its own write operation, so the read operation in a transaction needs to first try to read its own buffer, and if not, it will read tikv. When we start a transaction, perform a series of read-write operations, and finally commit, the corresponding events in tikv and its clients are shown in the following table:

Tikv source code analysis series article (12) distributed transaction


Transaction commit is a two-stage commit process. The first step is prewrite, that is, all the keys involved in the transaction are locked and the value is written. On the client side, the keys to be written are divided into regions, and the requests of each region are sent in parallel. Transaction will be brought in the requeststart_tsAnd the selected primary key. TiKVkv_prewriteThe interface is called to handle the request. Next, the request is handed over toStorage::async_prewriteTo deal with.async_prewriteThen give the task toScheduler

SchedulerIt is responsible for scheduling the read and write requests received by tikv, performing flow control, obtaining the snapshot from the engine (for reading data), and finally executing the task. Prewrite ends up inprocess_write_implIs actually carried out.

We ignore it for the time beingfor_update_ts, which is used for pessimistic transactions. We will explain pessimistic affairs in future articles. Therefore, the following logic is simplified as follows:

let mut txn = MvccTxn::new(snapshot, start_ts, !ctx.get_not_fill_cache())?;
for m in mutations {
   txn.prewrite(m, &primary, &options);
let modifies = txn.into_modifies();
//Then return to process  write:
engine.async_write(&ctx, to_be_write, callback);

In prewrite, we useMutationTo represent the write of each key.MutationDivided intoPutDeleteLockandInsertFour types.PutWrite a value to the key,DeleteDelete the key.InsertAndPutThe difference is that it will check whether the key exists when it is executed, and only when the key does not exist will it be written successfully.LockIt’s a special write, not in the percolator modelLockWhen a transaction reads some keys and writes some other keys, if you need to ensure that these keys will not change when the transaction is successfully submitted, you should write these read keys to theLockTypeMutation。 For example, in tidb, executeSELECT ... FOR UPDATEThis type of lockMutation

Next we create aMvccTxnAnd for eachMutationcallMvccTxn::prewriteMvccTxnEncapsulates our transaction algorithm. When we call itsprewriteMethod, instead of writing directly to the underlying storage engine, it caches the required writes in memory and calls theinto_modifiesMethod gives the final write required. The next step is to callengine.async_writeTo write this data to the underlying storage engine.engineIt is guaranteed that these changes will be written atomically once. In production, hereengineyesRaftKV, which will write data changes to disk after they are synchronized by raft.

Let’s seeMvccTxn::prewriteLogic in. It can be understood by referring to the pseudocode of prewrite in percolator paper:

bool Prewrite(Write w, Write primary) {
   Column c = w.col;
   bigtable::Txn T = bigtable::StartRowTransaction(w.row);
   // Abort on writes after our start timestamp ...
   if (T.Read(w.row, c+"write", [start_ts_ , ∞])) return false;
   // ... or locks at any timestamp.
   if (T.Read(w.row, c+"lock", [0, ∞])) return false;
   T.Write(w.row, c+"data", start_ts_, w.value);
   T.Write(w.row, c+"lock", start_ts_,
       {primary.row, primary.col}); // The primary’s location.
   return T.Commit();

The first step of tikv prewrite is constraint check:

if !options.skip_constraint_check {
   if let Some((commit_ts, write)) = self.reader.seek_write(&key, u64::max_value())? {
       if commit_ts >= self.start_ts {
           return Err(Error::WriteConflict {...});
       self.check_data_constraint(should_not_exist, &write, commit_ts, &key)?;

Corresponding to this step in percolator’s paper:

if (T.Read(w.row, c+"write", [start_ts_, ∞])) return false;

Can seeoptionsThere is one of them.skip_constraint_checkOptions. This field may be set in scenarios such as importing data to ensure that there will be no conflict. Skip the later checks to improve performance.seek_writeWill findCF_WRITEThecommit_tsThe latest wirte record less than or equal to the specified ts, return itscommit_tsAnd records. This is to find the latest write record and compare itcommit_tsAnd thestart_tsTo determine if there is a conflict.check_data_constraintIs used to process insert: whenMutationWhen the type is insert, we willshould_not_existSet astrue, the function checks whether the key exists (that is, whether its latest version is put). If it exists, the check fails.

The second step of tikv prewrite is to check whether the key has been locked by another transaction:

if let Some(lock) = self.reader.load_lock(&key)? {
   if lock.ts != self.start_ts {
       return Err(Error::KeyIsLocked(...));
   return Ok(());

Corresponding to this step in percolator’s paper:

if (T.Read(w.row, c+"lock", [0, ∞])) return false;

In tikv’s code, if the key is found to be locked by the same transaction (i.elock.ts == self.start_ts), it will directly return success, because we need to idempotent the prewrite operation, that is, allow the same request to be received repeatedly.

The last step is to write the lock and data. Write operations are cached inwritesField.


When prewrite is complete, the client will getcommit_ts, and then continue with the second phase of the two-phase submission. It should be noted here that since whether the primary key is successfully submitted indicates whether the whole transaction is successfully submitted, the client needs to commit the remaining keys after the primary key is committed separately.

The commit request will bekv_commitProcessing, and through the same path at the end of theprocess_write_implCommit branch execution for:

Let mut TxN = mvcctxn:: new (snapshot, lock ﹐ ts,! CTX. Get ﹐ not fill ﹐ cache())?; // lock ﹐ TS is start ﹐ TS
let rows = keys.len();
for k in keys {
   txn.commit(k, commit_ts)?;

MvccTxn::commitThe simple thing to do is to writeCF_LOCKDelete the lock in thecommit_tsstayCF_WRITEWrite a record of the transaction commit. However, due to various considerations, our actual implementation has done a lot of additional checks.

MvccTxn::commitThis function is applicable to both optimistic and pessimistic transactions. After removing the logic related to pessimistic transactions, the simplified logic is as follows:

pub fn commit(&mut self, key: Key, commit_ts: u64) -> Result<()> {
   let (lock_type, short_value) = match self.reader.load_lock(&key)? {
       Some(ref mut lock) if lock.ts == self.start_ts => { // ①
           (lock.lock_type, lock.short_value.take())
       _ => {
           return match self.reader.get_txn_commit_info(&key, self.start_ts)? {
               Some((_, WriteType::Rollback)) | None => {  // ②
                   Err(Error::TxnLockNotFound {...})
               Some((_, WriteType::Put))
               | Some((_, WriteType::Delete))
               | Some((_, WriteType::Lock)) => {           // ③
   let write = Write::new(
   self.put_write(key.clone(), commit_ts, write.to_bytes());

Normally, the key should have a lock for the same transaction. If this is the case (i.e. branch ① of the above code), then continue to write later. Otherwise, callget_txn_commit_infofindstart_tsWith the current transactionstart_tsEqual submission records. There are several possibilities:

  1. The key has been submitted successfully. For example, this may happen when a client fails to receive a successful response due to network reasons and initiates a retry. In addition, a lock may be committed by another transaction that encounters the lock (see the section “handling residual locks” below), which can also happen. In this case, it will go to branch ③ of the above code and return success (for idempotent) without any operation.
  2. The transaction was rolled back. For example, if a transaction cannot be successfully committed due to network reasons, it may be rolled back by other transactions until the TTL lock expires. This situation leads to branch ② of the above code.


In some cases, after a transaction is rolled back, tikv may still receive a write request for the same transaction. For example, the network may cause the request to stay on the network for a long time; or because the prewrite request is sent in parallel, one thread of the client receives a conflicting response and cancels the task of other threads to send the request and calls rollback. At this time, the prewrite request of one thread is just sent out.

In a word, when a key receives the prewrite of the same transaction after being rolled back, then we should not make it successful. Otherwise, the key will be locked and other reads and writes of the key will be blocked before its TTL expires. As you can see from the above code, ourWriteOne type of record is rollback. This record is used to mark the transaction rolled backcommit_tsSet tostart_tsThe same. This approach is not mentioned in percolator’s paper. In this way, if the prewrite of the same transaction is received after rollback, an error will be returned directly due to this part of the prewrite Code:

if let Some((commit_ts, write)) = self.reader.seek_write(&key, u64::max_value())? {
   If commit {/ T > = self.start {/ T
       return Err(Error::WriteConflict {...});
   // ...

Handle residual locks

If the client crashes in the process of transaction, or the whole transaction cannot be fully committed due to network and other reasons, there may be residual locks left in tikv.

On the tikv side, when a transaction (read or write) encounters a lock left by other transactions, as in the prewrite process above, it will return the encounter lock to the client. If the client finds that the lock has not expired, it will try backoff for a period of time and try again; if it has expired, it willResolveLocks

When resolvelocks, first obtain the current state of the transaction to which the lock belongs. It will call the primary of the lock (the primary is stored in the lock)kv_cleanupThis interface. The execution logic of cleanup is here. It’s actually a callMvccTxn::rollback。 If rollback is called on a committed transaction, it returnsCommittedError, the transaction committed will be brought with the error messagecommit_ts。 Cleanup will return thecommit_ts。 The purpose of calling cleanup here is to check whether the primary has been committed, if not, roll back; if it has been committed, get itcommit_ts, which is used to commit other keys of the transaction. Next, you can process other locks encountered by the current transaction according to the information obtained by calling Cleanup: calling tikv’skv_resolve_lockThe interface clears these locks, and whether to commit or roll back depends on the result of previous cleanup.

kv_resolve_lockThe interface has two execution modes: if the specified key is passed in the parameter, it will actually execute on the tikv sideResolveLockLite, that is, only the locks on the specified key are cleared. Otherwise, tikv will scan all of the current regionsstart_tsAnd specifiedtsMatch the lock and clear it all. When the latter method is used for execution, after tikv scans a certain number of locks, it will clear these locks first, and then continue to scan a certain number of locks and then clear them, so it will cycle until a complete region is scanned. This helps avoid generating too large a writebatch.

stayprocess.rsAs you can see in, the resolvelock command determines whether a read task or a write task is based on whether the scanned lock is carried. It will pass firstprocess_read; if the lock is scanned, it will returnNextCommandIndicates that the next command is required to continue processing. The next command will enterprocess_write, and call commit or rollback to handle it. If the current region has not been scanned, it will continue to returnNextCommand, next step will be re-entryprocess_readContinue scanning, so cycle.scan_keyField is used to record the current scan progress.

Scheduler and latch

As we know, percolator’s transaction algorithm is based on the fact that BigTable supports single row transactions. In tikv, every write operation (writebatch) sent to the engine is written atomically. However, it is obvious that both the prewrite and commit operations mentioned above need to be read before writing, so it is certainly not enough to only support atomic writing, otherwise, this situation exists:

  1. Transaction a attempts to write key1, and after reading it, it is found that there is no lock
  2. Transaction B tries to write key1, and after reading it, it also finds that there is no lock
  3. Write prewrite to transaction a
  4. Write prewrite to transaction B

In this way, the lock written by transaction a will be overwritten, but it will think that it has successfully written. If transaction a commits next, data consistency will be broken because one lock of transaction a has been lost.

SchedulerThis is avoided by scheduling transactions.SchedulerOne of the modules is calledLatches, which contains many slots. Before each task that needs to be written starts, it will fetch the hash of the key involved in it, and each key will fall into theLatchThe next step is to try to lock these slots. Only after the slots are locked successfully can the process of taking snapshot and reading and writing continue. In this way, if two tasks need to write the same key, they mustLatchesLock in the same slot of, so it must be mutually exclusive.


The above is the code analysis of tikv distributed transaction module, focusing on the code of write transaction. The next article will continue to show how tikv reads mvcc data and the code related to pessimistic transactions. The logic of tikv’s transaction is very complex. I hope these articles can help you understand and participate in the contribution.

Original reading:https://pingcap.com/blog-cn/tikv-source-code-reading-12/

More tikv source reading:https://pingcap.com/blog-cn/#TiKV-%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90

Tikv source code analysis series article (12) distributed transaction

Recommended Today

[Redis5 source code learning] analysis of the randomkey part of redis command

baiyan Command syntax Command meaning: randomly return a key from the currently selected databaseCommand format: RANDOMKEY Command actual combat:> keys * 1) “kkk” 2) “key1”> randomkey “key1”> randomkey “kkk” Return value: random key; nil if database is empty Source code analysis Main process The processing function corresponding to the keys command is […]