Introduction: This article is compiled by Chen Zhengyu, a community volunteer. The content comes from the detailed Flink CDC shared by Xu Bangjiang (Xuejin), a senior development engineer of Alibaba, at Flink meetup in Beijing on July 10. The core features brought by the latest release of Flink CDC version 2.0.0 were explained in depth, including major improvements such as concurrent reading of full data, checkpoint and lock free reading.
1、 CDC overview
The full name of CDC is change data capture. In a broad sense, any technology that can capture data changes can be called CDC. At present, the commonly described CDC technology mainly faces the change of database. It is a technology used to capture the data change in database. CDC technology has a wide range of application scenarios:
- Data synchronization: used for backup and disaster recovery;
- Data distribution: one data source is distributed to multiple downstream systems;
- Data collection: ETL data integration for data warehouse / data lake is a very important data source.
There are many technical solutions for CDC. At present, the mainstream implementation mechanisms in the industry can be divided into two types:
Query based CDC:
- Offline scheduling query jobs, batch processing. Synchronize a table to other systems, and obtain the latest data in the table through query each time;
- The data consistency cannot be guaranteed, and the data may have been changed many times during the query process;
- The real-time performance is not guaranteed, and there is a natural delay based on off-line scheduling.
Log based CDC:
- Real time consumption log and stream processing. For example, the binlog log of MySQL completely records the changes in the database. The binlog file can be used as the data source of the stream;
- Ensure data consistency, because binlog file contains all historical change details;
- Ensure real-time performance, because log files like binlog can be streamed and provide real-time data.
Compared with common open source CDC schemes, we can find that:
Compared with incremental synchronization capability,
The log based method can achieve incremental synchronization;
The query based approach is difficult to achieve incremental synchronization.
Compared with the full synchronization capability, CDC schemes based on query or log are basically supported, except canal.
Compared with full + incremental synchronization, only Flink CDC, debezium and Oracle GoldenGate support better.
From the perspective of architecture, the table divides the architecture into stand-alone and distributed. The distributed architecture here is not only reflected in the horizontal expansion of data reading ability, but also in the access ability of distributed system in the big data scenario. For example, when Flink CDC’s data enters the lake or warehouse, the downstream is usually distributed systems, such as hive, HDFS, iceberg, Hudi, etc. from the perspective of the ability to access distributed systems, Flink CDC’s architecture can well access such systems.
In terms of data conversion / data cleaning capability, is it convenient to filter, clean or even aggregate the data when it enters the CDC tool?
The operation on Flink CDC is quite simple, and these data can be operated through Flink SQL;
However, dataX and debezium need to be done through scripts or templates, so the user’s threshold will be relatively high.
In addition, in terms of ecology, this refers to the support of some downstream databases or data sources. There are abundant connectors downstream of Flink CDC, such as writing to common systems such as tidb, mysql, PG, HBase, Kafka, Clickhouse, etc. various custom connectors are also supported.
2、 Flink CDC project
At this point, let’s review the motivation of developing Flink CDC project.
1. Dynamic Table & ChangeLog Stream
As we all know, Flink has two basic concepts: dynamic table and changelog stream.
- Dynamic table is a dynamic table defined by Flink SQL. The concepts of dynamic table and flow are equivalent. Referring to the figure above, a flow can be converted into a dynamic table, and a dynamic table can also be converted into a flow.
- In Flink SQL, data flows from one operator to another in the form of changelog stream. The changelog stream at any time can be translated into a table or a stream.
If you think about tables and binlog logs in mysql, you will find that all changes to a table in MySQL database are recorded in the binlog log. If you keep updating the table, the binlog log stream will always be appended. The table in the database is equivalent to the materialization result of the binlog log stream at a certain time; The log flow is the result of continuously capturing the change data of the table. This shows that the dynamic table of Flink SQL can naturally represent a constantly changing MySQL database table.
On this basis, we investigated some CDC technologies and finally chose debezium as the underlying collection tool of Flink CDC. Debezium supports full synchronization, incremental synchronization and full + incremental synchronization. It is very flexible. At the same time, the log based CDC technology makes it possible to provide exactly once.
Comparing the internal data structure rowdata of Flink SQL with that of debezium, we can find that they are very similar.
- Each rowdata has a metadata rowkind, including four types: insert and update_ Before), mirror after update_ After and delete are consistent with the binlog concept in the database.
- Debezium’s data structure also has a similar metadata OP field. There are four values of OP field, namely C, u, D and R, corresponding to create, update, delete and read. For the U representing the update operation, the data part includes both before and after.
By analyzing the two data structures, the underlying data of Flink and debezium can be easily connected. We can find that Flink is technically very suitable for CDC.
2. Traditional CDC ETL analysis
Let’s take a look at the ETL analysis link of the traditional CDC, as shown in the following figure:
In traditional ETL analysis based on CDC, data collection tools are necessary. Foreign users often use debezium and domestic users often use Ali open source canal. The collection tools are responsible for collecting incremental data in the database, and some collection tools also support synchronous full data. The collected data is generally output to message middleware such as Kafka, and then consumed by Flink computing engine. This part of data is written to the destination. The destination can be various dB, data lake, real-time data warehouse and offline data warehouse.
Note that Flink provides changelog JSON format, which can write changelog data into offline data warehouses, such as hive / HDFS; For the real-time data warehouse, Flink supports writing the changelog directly to Kafka through the upsert Kafka connector.
We have been thinking about whether we can use Flink CDC to replace the collection components and message queues in the dotted box in the figure above, so as to simplify the analysis link and reduce the maintenance cost. At the same time, fewer components also mean that data timeliness can be further improved. The answer is yes, so there is our ETL analysis process based on Flink CDC.
3. ETL analysis based on Flink CDC
After using Flink CDC, in addition to fewer components and easier maintenance, another advantage is that Flink SQL greatly reduces the user’s use threshold. See the following example:
In this example, the database data is desynchronized through Flink CDC and written to tidb. The user directly uses Flink SQL to create the mysql-cdc table of products and orders, and then joins the data flow, which is directly written to the downstream database. Through a Flink SQL job, the data analysis, processing and synchronization of CDC are completed.
You will find that this is a pure SQL job, which means that as long as Bi who knows SQL can complete this kind of work. At the same time, users can also use the rich syntax provided by Flink SQL for data cleaning, analysis and aggregation.
These capabilities are very difficult for the existing CDC schemes to clean, analyze and aggregate data.
In addition, data widening and various business logic processing can be easily completed by using Flink SQL double stream join, dimension table join and udtf syntax.
4. Flink CDC project development
- In July 2020, Yunxie submitted the first commit, which is a project based on personal interest incubation;
- Mysql-cdc is supported in mid July 2020;
- Support Postgres CDC at the end of July 2020;
- In one year, the number of stars of the project on GitHub has exceeded 800.
3、 Flink CDC 2.0 details
1. Flink CDC pain points
MySQL CDC is the most used and important connector in Flink CDC. The following chapters describe that Flink CDC connectors are MySQL CDC connectors.
With the development of Flink CDC project, we have received feedback from many users in the community, which can be summarized into three main aspects:
- The process of full + incremental reading needs to ensure the consistency of all data, so it needs to be guaranteed by locking, but locking is a very high-risk operation at the database level. When the underlying debezium ensures data consistency, it needs to lock the read library or table. The global lock may lock the database. The table level lock will lock the read of the table. DBAs generally do not give lock permission.
- Horizontal expansion is not supported because the underlying layer of Flink CDC is based on debezium and the starting architecture is single node, so Flink CDC only supports single concurrency. In the full volume reading phase, if the table is very large (100 million level) and the reading time is at the level of hours or even days, users cannot increase resources to improve the job speed.
- Checkpoint is not supported in the full reading phase: CDC reading is divided into two phases, full reading and incremental reading. At present, checkpoint is not supported in the full reading phase. Therefore, there will be a problem: when we synchronize the full data, it is assumed that it takes 5 hours. When we synchronize the full data for 4 hours, the job fails, and we need to restart, Read for another 5 hours.
2. Debezium lock analysis
The underlying layer of Flink CDC encapsulates debezium. Debezium synchronizes a table in two stages:
- Full volume stage: query all records in the current table;
- Incremental phase: consume change data from binlog.
Most of the scenarios used by users are full + incremental synchronization. Locking occurs in the full phase. The purpose is to determine the initial location of the full phase and ensure that there are not many incremental + full items, so as to ensure data consistency. From the following figure, we can analyze some locking processes of global locks and table locks. The red line on the left is the life cycle of locks, and the right is the life cycle of MySQL opening repeatable read transactions.
Take a global lock as an example. First, obtain a lock, and then open a repeatable transaction. The locking operation here is to read the starting position of binlog and the schema of the current table. The purpose of this is to ensure that the starting position of binlog can correspond to the current schema read, because the schema of the table will change, such as deleting columns or adding columns. After reading these two information, the snapshotreader will read the full data in the repeatable read transaction. After reading the full data, it will start the binlogreader to start incremental reading from the starting position of the read binlog, so as to ensure the seamless connection of full data + incremental data.
Table lock is a degenerate version of global lock, because the permissions of global lock will be high. Therefore, in some scenarios, users only have table lock. The time of table lock will be longer, because the table lock has a feature: the lock is released in advance, and the repeatable transactions will be committed by default. Therefore, the lock cannot be released until the full amount of data is read.
After the above analysis, let’s see what serious consequences these locks will cause:
Flink CDC 1. X can be unlocked and can meet most scenarios, but at the expense of certain data accuracy. Flink CDC 1. X has a global lock by default. Although it can ensure data consistency, there is a risk of the above hang data.
3. Flink CDC 2.0 design (taking MySQL as an example)
Through the above analysis, we can know that the core of the design scheme of 2.0 is to solve the above three problems, that is, to support no lock, horizontal expansion and checkpoint.
The lock free algorithm described in dblog is shown in the following figure:
On the left is the description of chunk’s segmentation algorithm. Chunk’s segmentation algorithm is actually similar to the principle of database and table segmentation in many databases. The data in the table is segmented through the table’s primary key. Assuming that the step size of each chunk is 10, you can segment according to this rule. You only need to make these chunks into left open and right closed or left closed and right open intervals to ensure that the connected intervals can be equal to the primary key interval of the table.
On the right is the description of the lockless read algorithm of each chunk. The core idea of the algorithm is to complete the consistency merging without locking for the full read and incremental read of each chunk after dividing the chunk. The segmentation of chunk is shown in the following figure:
Because each chunk is only responsible for the data within its own primary key range, it is not difficult to deduce. As long as the read consistency of each chunk can be guaranteed, the read consistency of the whole table can be guaranteed. This is the basic principle of the lock free algorithm.
In Netflix’s dblog paper, the chunk reading algorithm is to maintain a signal table in the DB, dot the binlog file through the signal table, record the low position before reading and the high position after reading, and query the full data of the chunk between the low point and the high point. After reading out the data of this part of the chunk, the binlog incremental data between the two sites are combined into the full data of the chunk, so as to obtain the full data corresponding to the chunk at the time of the high point.
In combination with its own situation, Flink CDC improves the chunk reading algorithm by removing the signal table without additional maintenance. Instead of marking in binlog, it directly reads binlog sites. The overall chunk reading algorithm is described as follows:
For example, when reading chunk-1, the interval of chunk is [K1, K10]. First, directly select the data in this interval and store it in the buffer. Before selecting, record a site (low point) of binlog, and after selecting, record a site (high point) of binlog. Then start the incremental part and consume the binlog from the low point to the high point.
- The – (k2100) + (k2108) record in the figure indicates that the value of this data is updated from 100 to 108;
- The second record is deleted K3;
- The third record is updated K2 to 119;
- The fourth record is K5, and the data is changed from 77 to 100.
Observing the final output in the lower right corner of the picture, we will find that when consuming the binlog of the chunk, the keys that appear are K2, K3 and K5. We go to the buffer to mark these keys.
- For K1, K4, K6 and K7, these records have not changed after reading the high point, so these data can be directly output;
- For the changed data, the incremental data needs to be merged into the full amount of data, and only the merged final data is retained. For example, K2, if the final result is 119, only + (k2119) needs to be output, and the data that has changed in the middle is not required.
In this way, the final output of chunk is the latest data in chunk at the high point.
The above figure describes the consistent reading of a single chunk, but if multiple tables are divided into many different chunks and these chunks are distributed to different tasks, how to distribute the chunks and ensure global consistent reading?
This is implemented gracefully based on flip-27. You can see the component with sourceenumerator in the figure below. This component is mainly used for the division of chunks. The divided chunks will be provided to downstream sourcereaders to read. The process of concurrent reading snapshot chunks is realized by distributing chunks to different sourcereaders, At the same time, based on flip-27, we can easily achieve chunk granularity checkpoint.
After the snapshot chunk is read, there needs to be a reporting process, such as the orange reporting information in the figure below, to report the snapshot chunk completion information to sourceenumerator.
The main purpose of the report is to distribute binlog chunks later (as shown in the figure below). Because Flink CDC supports full + incremental synchronization, after all snapshot chunks are read, incremental binlog needs to be consumed. This is achieved by issuing a binlog chunk to any source reader for single concurrent reading.
For most users, in fact, they don’t need to pay too much attention to the details of how to unlock the algorithm and partition. It’s good to understand the overall process.
The overall process can be summarized as follows: first, the table is divided into snapshot chunks through the primary key, and then the snapshot chunks are distributed to multiple sourcereaders. Each snapshot chunk is read consistently without lock through an algorithm. The sourcereader supports chunk granularity checkpoints when reading. After all snapshot chunks are read, Issue a binlog chunk to read the binlog of the incremental part, which is the overall process of Flink CDC 2.0, as shown in the following figure:
Flink CDC is a completely open source project. All the design and source code of the project have been contributed to the open source community. Flink CDC 2.0 has also been officially released. The core improvements and enhancements this time include:
MySQL CDC 2.0 is provided, and the core features include
- For concurrent reading, the reading performance of full data can be expanded horizontally;
- There is no lock in the whole process, and there is no risk of lock in online business;
- Continuous transmission at breakpoint supports checkpoint at full volume stage.
- Build a document website, provide multi version document support, and support keyword search
The author tested the customer table in tpc-ds data set. Flink version is 1.13.1. The data volume of customer table is 65 million, and the source concurrency is 8. The full volume reading stage:
- MySQL CDC 2.0 takes 13 minutes;
- MySQL CDC 1.4 takes 89 minutes;
- Read performance is improved by 6.8x.
In order to provide better document support, Flink CDC community has built a document website, which supports document version management:
The document website supports keyword search function, which is very practical:
4、 Future planning
For the future planning of CDC project, we hope to focus on three aspects: stability, advanced feature and ecological integration.
- Attract more developers through the community, and the company’s open source strength improves the maturity of Flink CDC;
- Lazy assigning is supported. The idea of lazy assigning is to divide chunks into a batch first rather than all at once. Currently, the source reader splits the data reading by dividing all the chunks at one time. For example, if there are 10000 chunks, you can divide 1000 chunks first instead of dividing them all at one time. After the sourcereader reads 1000 chunks, you can continue to divide them, so as to save the time of dividing chunks.
- Schema evolution is supported. This scenario is: in the process of synchronizing the database, a field is suddenly added to the table, and it is hoped that this field can be automatically added when the downstream system is synchronized later;
- Support watermark pushdown to obtain some heartbeat information through CDC binlog. These heartbeat information can be used as a watermark. Through this heartbeat information, you can know some progress of current consumption of this stream;
- In the scenario of supporting meta data, sub database and sub table, metadata may be needed to know which database and table this data comes from. There can be more flexible operations in the downstream system when entering the lake and warehousing;
- Whole database synchronization: users only need one line of SQL syntax to synchronize the whole database, instead of defining a DDL and query for each table.
- Integrate more upstream databases, such as Oracle, Ms sqlserver. Cloudera is currently actively contributing to Oracle CDC connector;
- At the level of entering the lake, Hudi and iceberg have certain optimization space. For example, when entering the lake with high QPS, the data distribution has a relatively large performance impact, which can be continuously optimized through connection and integration with ecology.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.