[tcapsusdb knowledge base] data recovery function of tcapsusdb


Tcapsusdb is a NoSQL database developed by Tencent. It is customized according to the development characteristics of the game. It has the characteristics of high performance, low cost, high availability and strong flexibility. This paper focuses on the data recovery function of tcapsus.

Tcapsusdb uses master and slave as redundant backup of data. The roles of master and slave are equivalent. Specifically, the master’s data will be asynchronously synchronized to slave. When the master fails, slave will switch to provide services for master.

Data recovery includes two parts: backup and recovery, so it is introduced from the following two points.


The database needs to be backed up regularly. From the tcapsusdb architecture diagram, it is known that each pair of primary and standby computers regularly make backup to the cold standby center through the standby computer. When each machine is put on the shelf, the cold standby script will be deployed to the machine, and then the backup work will be completed through the interaction between the cold standby script and the cold standby center.

[tcapsusdb knowledge base] data recovery function of tcapsusdb

The role of cold standby script

  • Dump the current machine, dump the full amount of data, and submit it to the cold standby center. Then the cold standby center initiates scheduling and pulls the cold standby file to the cold standby center.
  • Submit the incremental binlog pipeline of the current machine to the cold standby center, and then the cold standby center initiates scheduling to pull the binlog pipeline file to the cold standby center.
  • Clean up all the uploaded cold standby files regularly.
  • Clean up the incremental binlog pipelining files that have been uploaded by the machine regularly.

Backup strategy

  • At 1:05 am every day, backup a full amount of data from tcapsvr slave to the BAK directory of the machine, waiting for the cold standby center to pull it.
  • The full backup of each machine should be uploaded to the cold standby center within 10 hours.
  • Flow the incremental log to the cold standby center every five minutes.
  • The incremental log of each machine is required to be pulled out by the cold standby center within one hour after submission.

Cold standby Center

  • The cold standby center used by tcaplus is maintained by the technical operation Department, and the backup of most business of mutual entertainment will be stored in the cold standby center.
  • Tcaplus interacts with the cold standby center about 1400 slave machines every day.
  • Each machine produces a maximum of 1.1tb of full capacity cold standby, and the incremental engine is about 500GB.
  • After the cold standby center initiates a submit task for an IP, it will initiate the Rsync thread on the IP of the destination machine to pull the file to the cold standby center.
  • For the IP to upload the cold standby, it distinguishes between Gigabit bandwidth and 10 Gigabit bandwidth. The upload and download speed of 10 Gigabit bandwidth is higher than that of Gigabit bandwidth.


Failure recovery

The most important fault recovery is machine fault, which is generally divided into slave fault and master fault

In case of a slave machine failure, the DBA initiates a slave rebuild transaction. At this time, the newly replaced machine will download the cold standby of the original suspended slave to the local computer and register it to the master after decompression.

In case of a master machine failure, the system will automatically switch slave to master, and the previous master to slave. Slave will initiate a reconstruction transaction and pull the cold standby of the new master for data recovery.

Business recovery

The business needs to recover some keys to the specified time point

This kind of problem is divided into two steps

① Construct the data to the specified time point, including pulling it from the cold standby center to the local and decompressing it to the specified directory.

② Find the corresponding key from the constructed data and import the data online. Before importing, the project team sometimes knows the specific key, and sometimes only knows the specific scope

Key known: traverse the constructed data to find the executed key, and then initiate the import.

Key unknown, only know the scope: tcapsusdb traverses the data in the scope and provides it to the project group, which identifies the key to be returned and imports it online.

Both cases are transactional.

Business test needs to copy the existing network data to the test environment

The existing network and test environment belong to two clusters, and the process is also divided into two steps: first, construct a copy of the existing network data, then copy it to the test environment and import it in full.

Tcapsusdb is a distributed NoSQL database produced by Tencent. The code for storage and scheduling is completely self-developed. It has the characteristics of cache + landing fusion architecture, Pb level storage, millisecond delay, lossless horizontal expansion and complex data structure. At the same time, it has the characteristics of rich ecology, convenient migration, extremely low operation and maintenance cost and five nine high availability. Customers cover games, Internet, government affairs, finance, manufacturing and Internet of things.