In 2007, a group of young people on the billing platform worked overtime to discuss the scheme in order to realize the bank level high availability and zero error account trading system. After months of repeated brainstorming and demonstration, they finally put forward the “tboss 7 * 24” disaster recovery scheme. After more than a year of implementation and promotion, they won the company level technology breakthrough award in 2009 and Tony’s approval, Since then, it has opened the way for the billing platform to build a financial database.
Along the way, the pursuit of technology is endless. The original intention of building a better distributed database product with high consistency, high availability and high performance has never changed, the system architecture has undergone three generations of optimization. The fourth generation product tdsql, which was approved in 2012, has been polished for five years and optimized the actual operation of a large number of businesses. It has cooperated with Tencent Financial cloud and officially exported to the public with Tencent cloud financial database. At present, 500 + financial customers cover banking, insurance, fund securities, consumer finance, third-party payment, billing, Internet of things Many fields such as government and enterprises, especially the implementation of the core trading system of many banks, marks the real breakthrough of tdsql into a financial level database product. It is the breakthrough here that makes the project team win the company level technical breakthrough award again this year, which is also the recognition of the company that the team has always focused on the bottom R & D of financial level database.
Of course, the channel resistance is long. With the continuous updating of hardware, the business is more and more diverse and complex. There are still many new challenges in the database field that we need to overcome. This paper attempts to record some thoughts and conclusions of the team in the process of climbing and rolling all the way in the past ten years. I hope the road behind will be wider and wider!
In essence, the problem to be considered for database products is still the problem of user experience! In terms of refinement, there are mainly two types of direct users of database products, namely developers and operation and maintenance personnel. The core is that they should use them well.
For example, for developers, their common concerns are:
1. Is the development interface standard? Are sound development guidelines provided?
2. Do developers need to care about whether data will be lost when a failure disaster occurs? Do you need to write complex disaster recovery switching logic?
3. Is the system performance good? Can it withstand the surge pressure of business?
4. Is the system open enough?
For the DBA of operation and maintenance personnel, they often care about issues such as:
1. Is there a standardized tool or page for the normal operation of the system?
2. Is the system transparent enough to quickly locate problems in case of exceptions?
3. Is a complete system operation manual provided?
4. Whether the supporting facilities are complete, such as monitoring system, release system, cold standby system, audit system, etc?
Of course, there is also an indirect user, that is, the real users of the business system using the database. Their concerns include:
1. Is their payment, transfer and other operations normal? Will they deduct more money if they don’t arrive at the account?
2. Can they initiate transactions anytime, anywhere, etc?
Generally speaking, in the case of limited resources and time, the needs of these types of users need to be differentiated into different priorities, and sometimes even conflict. Therefore, we always maintain such a principle along the way: under the condition of ensuring the basic demands of users (no loss of data, good accounts), and then constantly optimize the use experience of developers and operation and maintenance personnel. For example, our first generation product “tboss 7 * 24” gives priority to ensuring high availability and high consistency and ensuring that user data is good and not lost, but it greatly damages the experience of developers and operation and maintenance personnel, requires business developers to develop a large number of disaster recovery codes, and operation and maintenance personnel to maintain a large number of disaster modes, In other words, a lot of work is entrusted to business development and operation and maintenance personnel. In the subsequent generations of system changes, we continue to sink the functions, making the business development and operation and maintenance simpler and simpler, and the database itself will be more complex. Solving the demands of three types of users at the database level as much as possible has become the core challenge of tdsql.
Core challenges of tdsql
Tdsql has experienced numerous versions, large and small. The core challenges it always faces are:
1. Reliability of data. In any disaster situation, such as host failure, network failure, etc., there can be no data loss.
2. System availability. Based on the situation of multiple copies of data, how can the system ensure the rapid recovery of availability under some abnormal conditions and minimize the unavailable time.
3. High performance. First, stand-alone performance improvement can greatly reduce the cost of servers for massive services; Secondly, performance indicators are also one of the indicators that users can directly feel, so performance optimization has always been one of our highest priority tasks.
4. Scalability. That is commonly referred to as distributed. In fact, in the financial industry, there was little demand for distributed payment before, but with the rapid development of third-party payment, it has had a great impact on some bank systems, such as double 11, Spring Festival red envelope and other activities. It is difficult for such systems to use the traditional IOE architecture, so they also hope to use the distributed architecture to solve this problem. In addition to the impact of distributed architecture on personnel in the traditional financial IT industry from the perspective of thinking, technically, there are many challenges here. Tdsql is also gradually optimized and solved, such as the two most complex points, distributed transactions and distributed join operations. At present, we have completed the release of the distributed transaction function, and the distributed join is still in the process of internal testing.
5. Supporting tools. If a database software wants to experience well, it can not only provide several core packages, but also have corresponding operation management tools, problem diagnosis tools, performance analysis tools, etc., and it is an open and standard interface. Only in this way can we use it better and more seamlessly.
The first two of the above questions are about basic functions. They have been basically guaranteed in the first version, but is it enough? It’s far from enough. I don’t walk much. I encounter too few barriers. When I encounter a new pit, I may still fall. Only in enough rich scenarios and through a large amount of practical operation experience, can we make our system experience enough training, so as to maintain enough system availability and data reliability 9. The latter problems should also be a continuous optimization process. Tdsql has been looking for the most elegant way to solve them, so let’s take a look at how tdsql does these problems.
Set mechanism:All high availability mechanisms of tdsql are implemented in sets. There are multiple data nodes (1 active and N standby) in each set. The active and standby can be strong synchronous replication mechanism based on raft protocol or asynchronous replication mechanism. In case of host failure, with the assistance of the scheduling module, the selected standby machine will be promoted to the host according to the specified process to quickly recover the business without manual intervention in the whole process. In our normal use, it is generally recommended that one active node and two standby nodes be strongly synchronized, so that in case of primary node failure, automatic switching can be ensured, with RTO of 40s and RPO of 0.
Horizontal expansion.The distributed version is presented as a complete logical instance, and the back-end data is actually distributed on several sets (independent physical nodes). The logical instance shields the actual storage rules of the physical layer. The business does not need to care about how the data layer is stored, nor does it need to integrate the splitting scheme in the business code or purchase middleware. It only needs to be used like a centralized (stand-alone) database. At the same time, it supports real-time online capacity expansion. The capacity expansion process is completely transparent to the business without business downtime. During capacity expansion, only some segments have second level read-only (read-only is actually doing data verification), and the whole cluster will not be affected.
Distributed transactions.Tdsql implements the distributed transaction mechanism based on MySQL Xa, and makes sufficient robustness tests on various exception handling. Compared with single machine transactions, the performance loss is only 30%.
Secondary partition. The first level, which is often called horizontal splitting, uses the hash algorithm to make the data evenly distributed to all nodes at the back end; The second level partition uses the range algorithm, so that the relevant data can fall into a logical partition. For example, it can be partitioned according to time (similar to one partition every day, week and month), or according to business characteristics (similar to one partition per province and city, etc.). Secondary fragmentation can balance data distribution and access, provide basic support for rapid one click expansion, and also meet scenarios such as rapid data deletion.
Read write separation. Based on the read-write separation scheme of the database access account, the DBA can set relevant parameters for the account based on the business requirements to ensure that the old data will not be read and the write business will not be affected, and the business can realize the read-write separation without changing the code. This can greatly reduce the business operation cost.
In addition, tdsql has many advanced features such as globally unique digital sequence, unified parameter management, MySQL function compatibility, hotspot update and so on, which can meet various business needs.
The optimization of tdsql in the database kernel mainly focuses on data replication, performance, security and so on.
Strong synchronization mechanism.Tdsql’s strong synchronization mechanism for financial scenarios effectively solves the problems of MySQL’s native semi synchronization mechanism: performance degradation and timeout degradation to asynchrony. At present, under the strong synchronous mode of tdsql, there is no difference between the system concurrency (TPS / QPS) and the asynchronous mode, and basically no performance loss can be achieved.
1) We optimized the thread pool scheduling algorithm of MariaDB / percona to improve the extreme cases such as the uneven distribution of query and update requests among thread groups when the system is under heavy load. It can make better use of computing resources, reduce unnecessary thread switching, reduce the waiting time of requests in the queue, and process requests in time.
2) Asynchronization of group commit. After the worker thread enters the group submission queue in its session state, it no longer blocks the leader thread waiting for the group submission to complete the submission, but directly returns to process the next request.
3) InnoDB buffer pool usage optimization. For example, during full table scanning, avoid filling the innodbbuffer pool, but only take one piece for use.
4) During MySQL group submission, InnoDB avoids activities with mutex conflicts with group submission, such as InnoDB purge, to reduce conflicts and improve performance.
There are many similar performance optimization points. In some scenarios, the effect of a single point may not be obvious, but taken together, the current performance indicators are good as a whole. Based on the test results of sysbench OLTP, under the same hardware and test environment, the performance of tdsql is improved by 85% compared with the native version.
Security enhancements. A lot of optimizations and enhancements have been made in terms of security, including data file encryption, SQL firewall, SSL access, security audit, etc.
In addition, we have long focused on the three branch versions of MySQL: MariaDB, percona and MySQL community. We will also regularly integrate new features of the community.
The strong synchronization mechanism of tdsql itself can achieve global deployment, but in fact, most of our customers do not need global deployment in terms of cost or business scenarios. The common multi centers in the two places can basically meet their needs. Customers can choose different deployment schemes based on cost, disaster recovery requirements of their own business data and data center distribution. Tdsql makes a targeted trade-off between data reliability and availability to achieve flexible deployment. Two common deployment schemes include:
Two places and three centers
ZK is distributed in three centers of the two places.
1. If the primary IDC fails, it will not lose data. It will automatically switch to the standby IDC. At this time, it will degenerate into strong synchronization of a single IDC, which is risky.
2. If only the host fails, after comparing two local standby nodes and one local watcher node, switch to the node with the latest data, give priority to the watcher node with the IDC, and reduce cross IDC switching as much as possible.
3. In case of IDC failure, the election can be made automatically through ZK in another city:
a) If the standby IDC does fail, the watcher node of the primary IDC is automatically promoted to slave, and the primary IDC provides services
b) The active and standby networks are not connected, and the processing method is the same as that of a)
Two places and four centers
This scheme has the strongest adaptability, but it also has higher requirements for the number of computer rooms.
1. Cluster deployment of three centers in the same city, simplified synchronization strategy, simple operation, high data availability and consistency
2. Single center failure does not affect data service
3. Shenzhen production cluster three centers multi activity
4. The failure of the whole city can be switched manually
For tdsql developers and operation and maintenance DBAs, their supporting facilities, maintainability and transparency are very important, because this determines whether they can find problems in time and make changes and responses to problems quickly. Therefore, after two years of production work, tdsql provides a complete peripheral supporting system, such as:
1) Cold standby system. Based on HDFS or other distributed file systems, automatic backup and one click recovery can be achieved.
2) Message queuing. Binlog subscription service customized based on Kafka. Based on this message queue, tdsql also provides services such as SQL audit and multi-source synchronization (data with the same table structure is merged into one table).
3) Resource management. The tdsql instance is arranged based on CGroup to improve the utilization of machine resources.
4）OSS。 A unified HTTP interface is provided for all operations of tdsql, such as capacity expansion, backup, recovery, manual switching, application (modify / delete) instances, which can effectively automate and reduce the risk of human flesh operation and maintenance.
5) Data collection. All internal operation status or data of tdsql can be collected in real time. The business can make customized analysis or build an operation monitoring platform based on these data.
6) Monitoring platform. Based on all the data collected by the data acquisition module, the business can connect with the self built monitoring system, or directly use the monitoring system of tdsql (which needs to be deployed separately).
7) Management platform. Based on the above modules, tdsql comes with its own operation management platform (internal platform code red rabbit). DBA can basically carry out all routine operations through this management platform, and there is no need to log in to the background.
8) Audit module. Through the log collection and analysis of users’ access to the database, the audit module is used to help customers generate compliance reports and trace the source of accidents. At the same time, it strengthens the network behavior records of internal and external databases and improves the security of data assets.
The above modules can be combined freely without strong dependency. The operation and maintenance personnel can also connect to their existing platforms (such as monitoring, alarm, audit, etc.) through the interface provided by tdsql.
Write at the end
Along the way, tdsql has made some milestone achievements in reliability, availability, performance, scalability and supporting facilities, but it is far from the ultimate user experience. For example, we are targeting OLTP database, which is suitable for the scenario of high concurrency and short transactions, but customers sometimes need to run some OLAP operations on the database. Can we do this? How? The current distributed system can not really use a single database, so can it be achieved in the future hardware development? There are many challenges like these. The road of database research and development is difficult and long. Let’s encourage you.