Traditional data synchronization solution and Flink SQL CDC solution
Business systems often encounter the need to update data to multiple storage. For example: an order system just started, just need to write to the database to complete business use. One day, the Bi team expects to do full-text indexing on the database, so we need to write more data to es at the same time. After a period of transformation, we need to write more data to redis cache.
It is obvious that this mode is unsustainable. This dual write to each data storage system may lead to non maintenance and expansion, data consistency problems, etc., which need to introduce distributed transactions, and the cost and complexity will also increase. We can decouple through the CDC (change data capture) tool and synchronize to the downstream storage system that needs to be synchronized. In this way, the robustness of the system is improved and the subsequent maintenance is convenient.
Flink SQL CDC data synchronization and principle analysis
The full name of CDC is change data capture, which is a relatively broad concept. As long as we can capture the changed data, we can call it CDC. There are mainly query based CDC and log based CDC in the industry. You can compare their functions and differences from the following table.
Through the above comparison, we can find that log based CDC has the following advantages:
·It can capture all data changes and complete change records. It is widely used in remote disaster recovery, data backup and other scenarios. If it is query based CDC, it may lead to data loss in the middle of two queries
· Each DML operation has records, and there is no need to initiate a full table scan to filter like querying CDC. It has higher efficiency and performance, and has the advantages of low latency and no increase in database load
·There is no need to invade the business, decouple the business and change the business model
·In the query CDC, periodic query can’t detect whether the intermediate data is deleted or not
Introduction of CDC scheme based on log
From the perspective of ETL, we generally collect business database data. Here, we use MySQL as the database to collect, collect MySQL binlog through debezium and send it to Kafka message queue, then connect with some real-time computing engines or apps for consumption, and then transfer the data to OLAP system or other storage media.
Flink hopes to open up more data sources and give full play to its complete computing power. Our production mainly comes from business log and database log. Flink is very perfect in business log support, but it is still blank in database log support before Flink 1.11, which is one of the reasons why we need to integrate CDC.
Flink SQL supports a complete changelog mechanism internally, so Flink only needs to convert CDC data into data recognized by Flink for docking with CDC data. Therefore, the tablesource interface is reconstructed in Flink 1.11 to better support and integrate CDC.
The output of the reconstructed tablesource is the rowdata data structure, representing a row of data. There will be a metadata information on the rowdata, which is called rowkind It includes insert, before update, after update and delete, which is very similar to the concept of binlog in database. The JSON format collected by debezium contains old data, new data rows and original data information. The U of OP is the identifier of update operation, TS? Ms Represents the timestamp of the synchronization. Therefore, docking the data of debezium JSON is actually converting the original JSON data into the rowdata recognized by Flink.
Select Flink as ETL tool
When Flink is selected as ETL tool, in the data synchronization scenario, the synchronization structure is as follows:
Subscribe to the binlog of MySQL through debezium and transfer it to Kafka. Flink creates a Kafka table to specify the format as debezium JSON, and then calculates it through Flink or inserts it directly into other external data storage systems, such as elasticsearch and PostgreSQL in the figure.
However, this architecture has a drawback. We can see that too many components on the collection side lead to complicated maintenance. At this time, we will wonder whether we can use Flink SQL to directly dock with MySQL binlog data, and is there any alternative solution?
The answer is yes! The improved structure is as follows:
The community has developed a link CDC connectors component, which is a source component that can directly read full data and incremental change data from mysql, PostgreSQL and other databases. At present, it is open source. Open source address:
Flink CDC connectors can be used to replace the data acquisition module of debezium + Kafka, so as to realize the integration of Flink SQL acquisition + calculation + transmission (ETL)
· Easy to use out of the box
·Reduce maintenance components, simplify real-time links and reduce deployment costs
·Reduce end-to-end delay
· Flink It supports the reading and calculation of exactly only
·Data not landing, reduce storage costs
·Supports full and incremental streaming read
· binlog The collection sites can be traced back*
Practice of data synchronization scheme based on Flink SQL CDC
Here are three cases about the use of Flink SQL + CDC in real scenes. To complete the experiment, you need docker, mysql, elasticsearch and other components. Please refer to the reference documents of each case for details.
Case 1: Flink SQL CDC + jdbc connector
This case subscribes to our order table (fact table) data, sends MySQL binlog to Kafka through debezium, and outputs the results to the downstream PG database through join and ETL operations. Specifically, we can refer to the Flink official account: the case of “Flink JDBC Connector:Flink and database integration best practices” for practical operation.
Case 2: CDC streaming ETL
To simulate the order tables and logistics tables of e-commerce companies, it is necessary to conduct statistical analysis on the order data. Different information needs to be associated. After the large and wide tables of the subsequent orders are formed, they are handed over to the downstream business parties to use es for data analysis. This case demonstrates how to rely on Flink alone and not on other components. With the powerful computing power of Flink, binlog can be used in real time The data stream of is associated once and synchronized to es.
For example, the following Flink SQL code can complete the purpose of real-time synchronization of the full amount + incremental data of the orders table in MySQL.
CREATE TABLE orders ( order_id INT, order_date TIMESTAMP(0), customer_name STRING, price DECIMAL(10, 5), product_id INT, order_status BOOLEAN ) WITH ( 'connector' = 'mysql-cdc', 'hostname' = 'localhost', 'port' = '3306', 'username' = 'root', 'password' = '123456', 'database-name' = 'mydb', 'table-name' = 'orders' ); SELECT * FROM orders
In order to make readers better understand, we also provide a docker compose test environment. For more detailed case tutorials, please refer to the video links and document links below.
Case 3: streaming changes to Kafka
The following case is the day level total station statistics of Gmv. Including insert / update / delete, only the payment order can be calculated to enter Gmv and observe the change of Gmv value.
More application scenarios of Flink SQL CDC
Flink SQL CDC can not only be flexibly applied to real-time data synchronization scenarios, but also provide more scenarios for users to choose from.
Flexible positioning of Flink in data synchronization scenarios
·If you already have the acquisition layer (E) of debezium / Canal + Kafka, you can use Flink as the computing layer (T) and transport layer (L)
·You can also use Flink to replace debezium / canal. Flink directly synchronizes the change data to Kafka, and Flink unifies the ETL process
· If Kafka data cache is not needed, Flink can directly synchronize the data to the destination, and Flink can unify the ETL process
Flink SQL CDC: get through more scenes
· Real time data synchronization, data backup, data migration, data warehouse construction
Advantages: rich upstream and downstream (E & L), powerful computing (T), easy to use API (SQL), low latency of streaming computing
·Real time materialized view and streaming data analysis based on Database
·Index construction and real time maintenance
·Business cache refresh
· Decoupling of microservice and separation of reading and writing
·Dimension table association based on CDC
Let’s explain why the dimension table association based on CDC is faster than the dimension table query based on query.
Query based dimension table Association
At present, the way to query dimension tables is mainly through join. After the data comes in from the message queue, it sends IO requests to the database, and the database returns the results, merges them, and then outputs them to the downstream. However, this process inevitably leads to IO and network communication consumption, which makes the throughput unable to be further improved, even if some caching mechanisms are used, But because the cache update is not timely, the accuracy may not be so high.
Dimension table association based on CDC
We can import the data of the dimension table into the join state of the dimension table through CDC. In this state, because it is a distributed state, it stores the real-time image of the database dimension table in the database. When the message queue data comes, there is no need to query the remote database again, but directly query the state of the local disk to avoid io Operation, low latency, high throughput, more accurate.
Tips: at present, this function is in the planning of version 1.12. Please pay attention to flip-132 for specific progress.
·Flip-132: temporal table DDL (dimension table association based on CDC)
·Upsert data output to Kafka
·More CDC formats support (debezium Avro, Ogg, Maxwell)
· Batch mode supports CDC data processing
·Flink CDC connectors supports more databases
By comparing the traditional data synchronization scheme with the Flink SQL CDC scheme, this paper shares the advantages of Flink CDC, and introduces the implementation principles of the two types of CDC. The subsequent cases also demonstrate the introduction of the scenario of debezium subscribing to MySQL binlog, and how to integrate alternative subscription components through link CDC connectors. In addition, it also explains in detail the scenarios of Flink CDC in data synchronization, materialized view, multi machine room backup, etc., and focuses on the advantages of CDC based dimensional table Association compared with traditional dimensional table Association in the future planning of the community, as well as the work of CDC components.
I hope that through this sharing, you can have a new understanding and understanding of Flink SQL CDC. In the future actual production and development, I hope that Flink CDC can bring more convenient development and richer use scenarios.
[link to original text]
](https://developer.aliyun.com/…It can’t be reproduced without permission.