Real time incremental data synchronization based on canal and Flink (2)


This paper mainly introduces how to make DB data enter hive data warehouse accurately and efficiently from two aspects of binlog real-time collection and offline processing of binlog restore business data.


In data warehouse modeling, the original business layer data without any processing is called ODS (operational data store) data. In Internet enterprises, the common ODS data are business log data (log) and business DB data (DB). For business DB data, collecting business data from relational databases such as MySQL and importing it into hive is an important part of data warehouse production. How to synchronize MySQL data to hive accurately and efficiently? The common solution is batch access and load: directly connect to MYSQL to select the data in the table, then save it to the local file as the intermediate storage, and finally load the file to the hive table. The advantage of this scheme is that it is easy to implement, but with the development of business, its disadvantages are gradually exposed

  • Performance bottleneck: with the growth of business scale, the data stream of select from mysql, save to local file, load to hive takes more and more time, which cannot meet the time requirement of downstream data warehouse production.
  • Selecting a large amount of data directly from MySQL has a great impact on MySQL, which is easy to cause slow query and affect the normal service on the business line.
  • Because the syntax of hive itself does not support SQL primitives such as update and delete (higher version hive supports, but needs bucket + Orc storage format), it can not support the data with update / delete in MySQL very well.

In order to completely solve these problems, we gradually turn to the technical solution of CDC (change data capture) + merge, that is, real-time binlog Collection + offline processing binlog restore business data. Binlog is the binary log of MySQL, which records all data changes in MySQL. The master-slave synchronization of MySQL Cluster itself is based on binlog.

Realization idea

First, Flink is used to pull binlog data from Kafka to HDFS.

Then, for each ODS table, you need to take a snapshot at one time to read the stock data in MySQL to hive. The bottom layer of this process uses the method of directly connecting to MYSQL to select data, and you can use sqoop to import all the data at one time.

Finally, for each ODS table, merge is done daily based on the stock data and binlog generated by the increment of the day, so as to restore the business data.

Binlog is generated by stream. Through real-time collection of binlog, some data processing requirements are allocated from batch processing once a day to real-time stream. Whether from the performance or the access pressure to MySQL, there will be a significant improvement. Binlog itself records the type of data change (insert / update / delete). Through some semantic processing, it can achieve accurate data restoration.

Implementation plan

Flink handles the binlog of Kafka

Use Kafka source to parse the read data with JSON, and splice the parsed fields into strings, which conform to hive’s schema format. The specific code is as follows:

package com.etl.kafka2hdfs;

import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.core.fs.Path;
import org.apache.flink.runtime.state.StateBackend;
import org.apache.flink.runtime.state.filesystem.FsStateBackend;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.RollingPolicy;
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink;
import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;

import java.util.Map;
import java.util.Properties;

 *  @Created with IntelliJ IDEA.
 *  @author : jmx
 *  @Date: 2020/3/27
 *  @Time: 12:52
public class HdfsSink {
    public static void main(String[] args) throws Exception {
        String fieldDelimiter = ",";
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // checkpoint
        //env.setStateBackend((StateBackend) new FsStateBackend("file:///E://checkpoint"));
        env.setStateBackend((StateBackend) new FsStateBackend("hdfs://kms-1:8020/checkpoint"));
        CheckpointConfig config = env.getCheckpointConfig();

        // source
        Properties props = new Properties();
        props.setProperty("bootstrap.servers", "kms-2:9092,kms-3:9092,kms-4:9092");
        // only required for Kafka 0.8
        props.setProperty("zookeeper.connect", "kms-2:2181,kms-3:2181,kms-4:2181");
        props.setProperty("", "test123");
        FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<>(
                "qfbap_ods.code_city", new SimpleStringSchema(), props);
        DataStream<String> stream = env.addSource(consumer);

        // transform
        SingleOutputStreamOperator<String> cityDS = stream
                .filter(new FilterFunction<String>() {
                    //Filter out DDL operations
                    public boolean filter(String jsonVal) throws Exception {
                        JSONObject record = JSON.parseObject(jsonVal, Feature.OrderedField);
                        return record.getString("isDdl").equals("false");
                .map(new MapFunction<String, String>() {

                    public String map(String value) throws Exception {
                        StringBuilder fieldsBuilder = new StringBuilder();
                        //Parsing JSON data
                        JSONObject record = JSON.parseObject(value, Feature.OrderedField);
                        //Get the latest field value
                        JSONArray data = record.getJSONArray("data");
                        //Traversal, field value of the JSON array, only one element
                        for (int i = 0; i < data.size(); i++) {
                            //Get the ith element of the JSON array
                            JSONObject obj = data.getJSONObject(i);
                            if (obj != null) {
                                fieldsBuilder.append ( record.getLong ("Id"); // serial number ID
                                fieldsBuilder.append (fielddelimiter); // field separator
                                fieldsBuilder.append ( record.getLong ("es"); // business timestamp
                                fieldsBuilder.append ( record.getLong ("ts"); // log timestamp
                                fieldsBuilder.append ( record.getString ("type"); // operation type
                                for (Map.Entry<String, Object> entry : obj.entrySet()) {

                                    fieldsBuilder.append ( entry.getValue ()); // table field data

                        return fieldsBuilder.toString();



        // sink
        //If one of the following conditions is met, a new file will be generated by scrolling
        RollingPolicy<String, String> rollingPolicy = DefaultRollingPolicy.create()
                . withrolloverinterval (60L * 1000L) // scrolls the time to write a new file. The default is 60s. Adjust according to the specific situation
                . withmaxpartsize (1024 * 1024 * 128L) // set the maximum size of each file. The default size is 128M, which is set to 128M here
                . withinactivityinterval (60L * 1000L) // the default is 60 seconds. If the data is not written, it will be inactive. If the timeout occurs, new files will be scrolled
        StreamingFileSink<String> sink = StreamingFileSink
                //.forRowFormat(new Path("file:///E://binlog_db/city"), new SimpleStringEncoder<String>())
                .forRowFormat(new Path("hdfs://kms-1:8020/binlog_db/code_city_delta"), new SimpleStringEncoder<String>())
                .withBucketAssigner(new EventTimeBucketAssigner())
                . withbucketcheckinterval (1000) // bucket check interval, where 1s is set


For Flink sink to HDFS,StreamingFileSinkInstead of the previousBucketingSinkTo store upstream data in different HDFS directories. Its core logic is to divide the bucket, and the default way is to divide the bucketDateTimeBucketAssignerAccording to the processing time. Processing time refers to the time when the message arrives at the Flink program, which is not in line with our requirements. Therefore, we need to write our own code to parse the event time from the message body and generate the bucket name according to the rules. The specific code is as follows:

package com.etl.kafka2hdfs;

import org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.SimpleVersionedStringSerializer;
import java.text.SimpleDateFormat;
import java.util.Date;

 *  @Created with IntelliJ IDEA.
 *  @author : jmx
 *  @Date: 2020/3/27
 *  @Time: 12:49

public class EventTimeBucketAssigner implements BucketAssigner<String, String> {

    public String getBucketId(String element, Context context) {
        String partitionValue;
        try {
            partitionValue = getPartitionValue(element);
        } catch (Exception e) {
            partitionValue = "00000000";
        Return "DT =" + partitionvalue; // partition directory name

    public SimpleVersionedSerializer<String> getSerializer() {
        return SimpleVersionedStringSerializer.INSTANCE;
    private String getPartitionValue(String element) throws Exception {

        //Take out the ES field value of the last splicing string, which is the business time
        long eventTime = Long.parseLong(element.split(",")[1]);
        Date eventDate = new Date(eventTime);
        return new SimpleDateFormat("yyyyMMdd").format(eventDate);

Offline restore MySQL data

After the above steps, you can write the binlog log log to the corresponding partition of HDFS. Next, you need to restore the latest data according to the incremental data and the stock data. Hive table is stored on HDFS. The file system does not support modification, so we need some extra work to write data changes. Common methods include: join, hive transaction, or change to HBase, kudu.

Such as yesterday’s stock data_ City, the incremental data today is code_ city_ Delta, which can be accessed throughFULL OUTER JOINTo merge the stock and incremental data into the latest data table and use it as the stock data for tomorrow:

        COALESCE(, ) AS id,
        COALESCE (, ) AS city,
        COALESCE ( t2.province, t1.province ) AS province,
        COALESCE ( t2.event_time, t1.event_time ) AS event_time 
        code_city t1
        (-- take the last status data
        row_number ( ) over ( PARTITION BY id ORDER BY event_time DESC ) AS rank 
        DT ='20200324 '- partition data
        ) temp 
        rank = 1 
        ) t2 ON =;


This paper mainly introduces the implementation of real-time ETL through Flink from two aspects of binlog streaming collection and ODS data restoration based on binlog. In addition, binlog logs can be written into NoSQL supporting transaction operations such as kudu and HBase, so that the steps of data table restoration can be omitted. This paper is the second part of “realizing real-time incremental synchronization of data based on canal and Flink”. For the implementation steps of parsing binlog log log by canal and writing to Kafka, please refer to “realizing real-time incremental synchronization of data based on canal and Flink I”.



Official account “big data technology and multi warehouse”, reply to “information” to receive big data package.