Author introductionLiu Chunlei, 58 Group Senior DBA, responsible for the operation and maintenance of MySQL and TiDB, TUG Ambassador.
There are many kinds of business in Group 58, including 58 cities, market networks, settlers, 58 financial companies, China Elite Network, Driving School One-Point, etc. The types of databases include MySQL, Redis, MongoDB, ES, TiDB. We built our own “58 cloud DB platform” which integrates all database operations and maintenance.
This paper will focus on TiDB application practice and follow-up plan in 58 Group from the perspective of operation and maintenance.
1. General situation of TiDB in 58 Group
We currently use TiDB’s 50 + servers with 24-core CPU and 128-G memory. We use a precious memory flash card. There are 88 TiKV instances deployed, 7 clusters, one cluster for each business, involving multiple versions of TiDB. Because it is a single cluster of multiple libraries, the current number of libraries is about 21. At present, the amount of data on disk is not too large, about 10T. It covers about 7 business lines, including 58 recruitment, TEG, settlers, user growth, information security, financial companies and car business. There will be more business promotion in the future.
II. Business requirements and Solutions
< center > Figure 1 Business and Requirements </center >
There are currently four business requirements:
Data with large capacity and long-term retention
At present, MySQL is single-machine storage, physical machine capacity is limited, about 3T single-machine capacity, due to disk space bottleneck, MySQL expansion is more troublesome.
Ensuring High Availability of Business
At present, what we do on MySQL is master-slave replication + MHA. One of the problems in this scheme is that when the master library hangs down, it needs to switch between master and slave, which will affect writing for a certain time, which has a greater impact on business.
Require higher read and write performance
MySQL is currently written on a single point, that is, in the main library. If you want to read, you need to read from the domain name to the slave library. The reading delay is relatively high. At the same time, the increase of reading traffic will further increase the problem of high latency.
The sub-database and sub-table are very painful.
In the case of large amount of data, it is necessary to divide the database into tables, which are painful for everyone. Because aggregation is difficult, business side development colleagues have to maintain the corresponding routing information of the database tables themselves.
These points are well solved on TiDB. For example, TiDB can scale horizontally. If the computing power is not enough, add nodes directly. Moreover, TiDB has multiple copies, which can ensure data security and high availability. In addition, TiDB Server is stateless and supports multiple reads and writes. TiDB does not need sub-database and sub-table. It is easy to operate and does not need to clean up data regularly.
III. TiDB Environment Construction
The environment construction of TiDB includes developing tools to analyze slow SQL, improving monitoring system, and connecting TiDB to “58 Cloud DB Platform”, collecting data, making visual reports and so on.
< center > Figure 2 TiDB deployment architecture </center >
The architecture of TiDB in 58 Group applications, as shown above, is mainly divided into four modules: management machine, cloud platform, monitoring, TiDB cluster and so on.
- Management machine
Mainly responsible for environment deployment, monitoring program, after topology query, SQL analysis, report program, TiDB cluster status checking tools.
- 58 Cloud DB Platform
The main functions of the platform include meta-information maintenance, work order processing, specific display of cluster information, monitoring overview, and some self-service query access, such as the development and use of self-service query to view their business TiDB cluster. In addition, there are operation reports, TiDB cluster applications and other functions.
Including instance monitoring, server monitoring and alarm.
- Specific TiDB Cluster
It is mainly divided into read-write DNS and read-only DNS, which are connected to read-write TGW and read-only TGW respectively (TGW is Tencent Gateway of Tencent), and routed to specific TiDB cluster through read-write account or read-only account.
2. TiDB ecological tools
We have recently developed the following operational and maintenance tools.
(1) Topology query tool:
Used to view the specific topology of a cluster.
(2) SQL analysis tools:
Slow SQL collection and analysis in TiDB version 2.X is a little more complex and not supported.
pt-query-degistThis tool (supported in the latest version 2.1 and 3.0), so we started to write a SQL analysis tool to directly analyze a log file of slow SQL, and summarize and display the results (this problem has been well solved in tidb 3.0, directly from
SLOW_QUERYThis table extracts the results and presents them directly.
<center>Figure 3 Slow SQL Analysis Tool </center>
This slow SQL analysis tool for TiDB version 2.X is mainly to determine the collection interval of slow log, format and logize all SQL, collect the type and specific information of each type of SQL, and then put the specific SQL of this kind of logical SQL on a specific file, and then show its specific situation. The situation is shown in the following figure.
Examples of slow SQL analysis results in < center > Figure 4 </center >
The main information includes sorting, database name, account number, average execution time, execution times, specific logical SQL and so on.
(3) State checking tools:
We will temporarily check the status of a cluster, such as downtime checks and so on. This is a tool similar to monitoring to prevent status misreporting when the cluster is busy. Because our current monitoring is to obtain data through Prometheus, but Prometheus is a single point. If Prometheus hangs up, or when the TiDB cluster is particularly busy, it may have a high latency to collect data from Prometheus, and then you can judge that the TiDB cluster may hang up, then we will use it.
tidb_checkView the true state of the TiDB cluster.
< center > Fig. 5 TiDB status checking tool </center >
The main way is to generate an instance’s topology file based on meta-information. After looking at all the topologies of the cluster, we can get data from Prometheus and aggregate them. Finally, we can push the results to Zabbix for alarm service. (At present, we use Zabbix as a unified monitoring and alarm platform, which is not useful for the time being.) Officially recommended Alter manager, and then put into storage for display. In fact, the problem of cluster state false alarm can also be solved from another angle. One interface of each component can be used to get a state of the cluster, so as to prevent Prometheus single point or other problems from causing false alarm. This function is currently under development.
(4) Report information collection tools:
Reporting information collection tool also obtains data through an interface of Prometheus, obtains the current database and table situation, checks on specific clusters, and also looks up some Slow Query tables under TiDB version 3.0 to summarize the situation of slow SQL.
(5) Monitoring automation tools:
Monitor us through
tidb_monitorThis tool, from Prometheus to obtain monitoring data of each node, logically pushed to Zabbix, our monitoring platform, and then use Zabbix for trend map display and alarm.
<center>Figure 6 Operations and Maintenance Management Platform Architecture </center>
In the aspect of platform, we connect TiDB to “58 Cloud DB Platform” and use open source inception to process DDL/DML work orders. The platform is divided into management side and user side. The management side is used by DBA to do meta-information maintenance, work order processing, operation reports, monitoring overview, etc. On the client side, the business will apply for TiDB cluster, DDL/DML work order, account management, view the information and monitoring of the cluster, and they can also query the data in the database by themselves.
<center> Figure 7 Operation and Maintenance Management Platform Display (1/2)</center>
<center> Figure 8 Operation and Maintenance Management Platform Display (2/2)</center>
TiDB operation and maintenance management is mainly about information display of clusters, monitoring of clusters, or adding TiDB/TiKV/PD nodes. In addition, we can add instances in batches, select the machines, match the roles, and then designate the development leader, which can be added directly.
4. Visual Report
<center> Figure 9 Visual Report Category </center>
The work of visual report is to put the monitoring data of Prometheus or Zabbix of server on the platform and provide developers and DBA with view. The main dimensions include server load, CPU memory, disk, network, IO, etc. On the cluster side, the current usage and total capacity of the cluster are obtained through the interface of Prometheus. On the database and table side, the data growth of the observation database is collected regularly.
IV. Business and TiDB usage
<center> Figure 10 Business </center> currently using TiDB
At present, TiDB business of Group 58 mainly includes TEG business, settler (log), user growth business (58 consulting, address book data preservation), information security (verification center), financial company (bottom storage of financial real-time data warehouse), car business (used car phone bill distribution), etc. Among them, the most widely used business is TEG business, settler (log), user growth business (58 consulting, address book data preservation), information security (verification center). Most of them are TEG business.
TEG business mainly includes WList, WTable management background, search index and so on. These are the management end of our self-research database. At present, the amount of writing is relatively large, the amount of data is about 6T, the data growth is about 500G/month. TEG business damaged eight flash cards in the past six months, but they did not affect the business, let us fully. Feel the advantage of TiDB’s high availability.
<center> Figure 11 TiDB database total growth trend </center>
At present, TiDB’s total application volume in 58 groups is growing rapidly. TiDB has been accessed since mid-2018. Up to now, there are 88 TiKV instances and 22 libraries. Especially in the second quarter of this year, TiDB began to grow vigorously.
V. Follow-up Plan
We plan to migrate 18 MySQL clusters to TiDB with a total disk volume of 30T and data volume of 200 billion yuan. One of the most important is the PMC order pipeline library, which has eight MySQL clusters are sub-libraries, each cluster disk volume 2T, the migration process of TiDB should be a great challenge.
<center> Figure 12 Follow-up Business </center> Planned to Use TiDB
In terms of operation and maintenance, we have started to prepare for version upgrades, which may all be moved to TiDB version 3.0. Now we have upgraded a set, which is very stable. As for monitoring perfection, it has just been mentioned that monitoring tools will acquire data through multiple component interfaces to prevent single-point problems from causing false alarms. In terms of report functions, we are also continually developing and improving, such as the optimization of slow SQL queries in version 3.0. In addition, because there are several warehouse businesses, we also consider using TiSpark and TiFlash to improve system performance. Finally, we are also developing automated deployment, scaling and fault handling.
This paper is based on Mr. Liu Chunlei’s speech at TiDB Tech Day 2019 Chengdu Station.
More case studies：https://pingcap.com/cases-cn/