Ticdc is a tidb incremental data synchronization tool realized by pulling the tikv log. It has the ability to restore data to a state consistent with any upstream Tso. At the same time, it provides an open data protocol and supports other systems to subscribe to data changes. Ticdc is stateless when running, and high availability is achieved with the help of etcd inside PD. Ticdc cluster supports the creation of multiple synchronization tasks to synchronize data to multiple different downstream.
Before 4.0, tidb provided tidb binlog to realize near real-time replication to downstream platforms. In tidb 4.0, ticdc was introduced as the capture framework of tidb change data. With the official release of tidb 4.0.6, the first GA version of ticdc has the ability to run in the production environment. Its main advantages are as follows:
- High data availability: ticdc obtains the change log from tikv, which means that as long as tikv has high availability, it can ensure the high availability of data. In the extreme case of abnormal shutdown of all ticdc, subsequent startup can still obtain data normally.
- Horizontal expansion: support the establishment of multi ticdc node clusters to evenly schedule synchronization tasks to different nodes. In the face of massive data, you can solve the problem of excessive synchronization pressure by adding nodes.
- Automatic failover: when a ticdc node in the cluster exits unexpectedly, the synchronization task on the node will be automatically scheduled to the other ticdc nodes.
- Support multiple downstream systems and output formatsAt present, MySQL compatible databases, Kafka and pulsar distributed stream processing systems are supported, and the output formats are supported:Apache Avro，MaxwellandCanal。
Two centers active and standby
Database is the core of enterprise it. On the basis of stable operation, the disaster recovery construction of database has become the prerequisite to ensure business continuity.
Considering the business criticality, cost and other factors, some users hope that the core database only needs to complete the disaster recovery of the primary and standby sites. It is an ideal choice to use ticdc to build the disaster recovery scheme of tidb primary and standby sites. Based on the data synchronization function of ticdc, the scheme can adapt to the scenario of long-distance interval and large network delay between the two centers, conduct one-way data synchronization between the tidb clusters of the two data centers, ensure the final consistency of transactions, and realize second RPO.
Ring synchronization and multi activity
The ring synchronization between three tidb clusters is realized by ticdc, and a multi center disaster recovery scheme of tidb is constructed. When an unexpected power failure occurs in the cabinet of one data center, the business can be switched to the tidb cluster of another data center to achieve the final consistency of transactions and second RPO. In order to share the pressure of business access, the application layer can switch routes at any time, switch traffic to the tidb cluster with low load, provide services, realize load balancing, and improve disaster recovery capability while meeting the high availability of data.
Ticdc provides real-time, high throughput and stable data subscription services for downstream data consumers. It interfaces with MySQL, Kafka, pulsar, Flink, canal, Maxwell and other heterogeneous ecosystems through open protocol to meet users’ application and analysis needs for various types of data in big data scenarios. It is widely applicable to log collection, monitoring data aggregation Streaming data processing, online and offline analysis and other scenarios.
Quotations from Chairman Mao Zedong
Xiaohongshu is a lifestyle platform for young people. Users can record their life in the form of short videos and graphics, share their lifestyle, and form interaction based on interest. By October 2019, the number of monthly active users of xiaohongshu had exceeded 100 million and continued to grow rapidly.
Xiaohongshu uses tidb to carry the core business in many scenarios, such as report analysis, large promotion of real-time large screen, logistics warehousing, e-commerce data middle office, content security audit and analysis. In the content security audit analysis scenario, the upstream tidb carries the real-time record of security audit data, which is written directly by the online application to realize the monitoring and analysis of real-time data.
In the business process of audit data analysis, the real-time stream data of tidb is extracted through ticdc, received from the downstream Flink for real-time calculation and aggregation, and the calculation results are written into tidb again for audit data analysis, labor efficiency analysis and management, etc. Little red book calls ticdc’s internal API（Sink Interface）Customize sink, send data to Flink using canal protocol, connect with existing business systems, and significantly reduce the transformation cost of business systems. Ticdc’s efficient data synchronization and support for heterogeneous data ecology have laid a solid foundation for the real-time processing of xiaohongshu business data.
Auto home is the most visited auto website in the world. It is committed to empowering users and customers through product services, data technology, ecological rules and resources, and building four circles of “car media, car e-commerce, car finance and car life”.
Tidb has been running stably in auto home for more than two years, carrying important businesses such as forum reply, resource pool, friend relationship and so on. In the 818 large-scale promotion activity in 2020, the car home adopted the scheme of tidb two places and three centers to provide all-round data protection for second kill, red envelope, lucky draw, shake and other scenarios. Ticdc was used to synchronize the tidb cluster data to the downstream MySQL database in real time as a backup for emergency failure, so as to improve the business disaster recovery ability. The delay of ticdc data synchronization is at the second level, which well meets the real-time requirements of online promotion services.
Intelligent recommendation is an important business of auto home, and resource pool is the underlying storage of intelligent recommendation. The resource pool receives and gathers all kinds of information. After data processing, it is used for recommendation and display at the front desk of home page recommendation, product display, search and other businesses. In the early stage, the resource pool uses MySQL as the storage layer and uses MySQL binlog to import elasticsearch to meet the needs of retrieval scenarios. Due to the performance and capacity bottleneck of MySQL, after switching to tidb, car home uses ticdc to synchronize heterogeneous data and replace the original MySQL binlog scheme. Ticdc features high availability, low latency and supports large-scale clusters, providing stable data support for businesses. Based on ticdc, auto home has developed the log data output to Kafka interface to realize the synchronous processing of massive heterogeneous data. At present, it has been online and running stably for more than two months.
Haier smart home app is the official interactive portal for mobile experience released by Haier. It provides global users with the whole process service of smart home, whole scene smart home experience and one-stop smart home customization scheme.
Haier Zhijia’s IT technology facilities are built on Alibaba cloud. Its core business requires that the database support MySQL protocol, provide flexible online expansion capability on the basis of meeting strong consistent distributed transactions, and can be closely integrated with all kinds of big data technology ecology. Tidb 4.0 has become an ideal choice for Haier Zhijia.
Use tidb incremental data synchronization tool ticdc to synchronize user information and home information to elasticsearch, providing near real-time search function. At present, the user table data is nearly ten million, the data volume reaches 1.9g, and the daily consumption message volume of Kafka is about 3 million. In addition, ticdc provides stable and efficient data synchronization for smart recommended big data services. Based on the unified ticdc open protocol line level data change notification protocol, it greatly facilitates the data analysis needs of different departments. At present, the function of smart recommendation is under development.
Zhihu is a comprehensive content platform of Chinese Internet, with “let everyone get reliable answers efficiently” as the brand mission and Polaris.
We know that tidb is used as the core database in scenarios such as personalized content recommendation and read services on the home page, and the logs are output to Kafka through ticdc open protocol for massive message processing. With the growth of business level, many problems caused by the limitations of Kafka architecture and historical version implementation have been encountered in the process of use. Considering that pulsar’s support for Geo replication is more in line with the direction of cloud protobiology of infrastructure in the future, it is known that pulsar is used to replace Kafka in some businesses.
Zhihu has carried out a series of development work on the core module of ticdc（https://github.com/pingcap/ti…， https://github.com/pingcap/ti…）, connect ticdc sink with pulsar to synchronize ticdc data to pulsar. With the geo replication function of pulsar, it can bring geographically independent change event subscription capability to ticdc consumers. Pulsar cluster’s fast node expansion and fault recovery capabilities can provide better data real-time guarantee for consumers of ticdc events.
From the practice of early business, the application of pulsar and ticdc has achieved ideal results. We know that we will promote the comprehensive migration of various services from Kafka to pulsar, and pulsar will also be applied to the scenario of synchronizing tidb data across clusters in the future.
You can use tiup（Deployment document）Rapidly deploy ticdc on
cdc cliCreate a synchronization task to synchronize real-time writes to the downstream tidb, the downstream pulsar, or the downstream Kafka（Operation document）
For users before 4.0.6 GA, please refer toUpgrade document。
thankAll ticdc contributors, ticdc can’t go to GA without the efforts of every contributor!