[tcallusdb knowledge base] data recovery function of tcallusdb


Tcaplus DB is a NoSQL database developed by Tencent. It is customized according to the development characteristics of the game. It has the characteristics of high performance, low cost, high availability and strong elasticity. This paper focuses on the necessary data recovery function in tcaplus high availability.

Tcallusdb uses active and standby for redundant data backup. The roles of master and slave are equal. Specifically, the data of master will be asynchronously synchronized to slave. When the master hangs up, slave will switch to provide services for master.

Data recovery includes two parts: backup and recovery, so it is introduced from the following two points.


The database needs to be backed up regularly. It can be seen from the tcallusdb architecture diagram that each pair of active and standby backup is regularly backed up to the cold standby center through the standby machine. When each machine is on the shelf, the cold standby script will be deployed to the machine, and then the backup work will be completed through the interaction between the cold standby script and the cold standby center.

[tcallusdb knowledge base] data recovery function of tcallusdb

Role of cold standby script

  • Dump the current machine, dump the full amount of data, and submit it to the cold standby center, and then the cold standby center initiates scheduling to pull the cold standby file to the cold standby center.
  • The incremental binlog pipeline of the current machine is submitted to the cold standby center, and then the cold standby center initiates scheduling to pull the binlog pipeline file to the cold standby center.
  • Regularly clean the uploaded full volume cold standby files of the machine.
  • Periodically clean up the uploaded incremental binlog pipeline files of the machine.

Backup strategy

  • At 1:05 a.m. every day, backup a full amount of data from tcapsvr slave to the BAK directory of the machine and wait for the cold standby center to pull it.
  • The full backup of each machine shall be uploaded to the cold standby center within 10 hours.
  • Flow the incremental log to the cold standby center every five minutes.
  • After submitting the incremental log of each machine, the cold standby center is required to pull it within one hour.

Cold standby Center

  • The cold standby center used by tcaplus is maintained by the technical operation Department. Most business backups of mutual entertainment will be stored in the cold standby center.
  • About 1400 slave machines interact with the cold standby center every day.
  • The maximum amount of full cold standby generated by each machine is 1.1tb, and the incremental engine is about 500GB.
  • After the cold standby center initiates a submit task for an IP, it will initiate an Rsync thread on the IP of the destination machine to pull files to the cold standby center.
  • For the IP to upload cold standby, Gigabit bandwidth and 10 Gigabit bandwidth are distinguished. The upload and download speed of 10 Gigabit bandwidth is higher than that of Gigabit bandwidth.


Fault recovery

The most important fault recovery is machine fault, which is generally divided into slave fault and master fault:

For a slave machine failure, the DBA initiates a slave reconstruction transaction. At this time, the newly replaced machine will download the cold standby of the original suspended slave locally, decompress it and register it with the master.

In case of master machine failure, the system will automatically switch slave to master, and the previous master will switch to slave. Then slave will initiate the reconstruction transaction and pull the cold standby of the new master for data recovery.

Business recovery

The business needs to recover some keys to the specified time point

Such questions are divided into two steps:

① Construct the data to the specified point in time, including pulling it from the cold standby center to the local and decompressing it to the specified directory.

② Find the corresponding key from the constructed data and import the data online. Before importing, the project team sometimes knows the specific key and sometimes only the specific scope. At this time, it is discussed in two cases:

Key known: traverse the constructed data, find the executed key, and initiate import.

The key is unknown, only the range is known: tcallusdb traverses the data within the range and provides it to the project team, which identifies the key that needs to be returned to the file and imports it online.

Both cases are transactional.

The business test needs to copy the existing network data to the test environment

The existing network and test environment belong to two clusters, and the process is also divided into two steps: first, construct a copy of the data of the existing network, then copy it to the test environment and import it in full.

Tcallusdb is a distributed NoSQL database produced by Tencent. The storage and scheduling code is completely self-developed. It has the characteristics of cache + floor fusion architecture, Pb level storage, millisecond delay, lossless horizontal expansion and complex data structure. At the same time, it has the characteristics of rich ecology, convenient migration, extremely low operation and maintenance cost and five nine high availability. Customers cover games, Internet, government affairs, finance, manufacturing, Internet of things and other fields.