Flink source code | custom format to consume Maxwell CDC data

Time:2020-11-18

Hive streaming, the most important feature of Flink 1.11, has been shared with you before. Today I’d like to talk about another particularly important feature, CDC.

CDC overview

What is CDC? Change data capture to record the operations of adding, modifying and deleting in the database. A long time ago, the trigger was used to complete the record. Now it is implemented by binlog + synchronization middleware. There are many commonly used binlog synchronization middleware, such as Alibaba open source canal [1], red hat open source debezium [2], Zendesk open source Maxwell [3], and so on.

These middleware will be responsible for parsing the binlog and synchronizing it to the message middleware. We only need to consume the corresponding topic.

Back on Flink, CDC doesn’t seem to have much to do with us? In fact, it is not. Let’s look at the world more abstractly.

When we use Flink to consume data such as Kafka, we are reading a table. What table? In a table where records are constantly inserted, we extract each inserted data and complete our logic.

Flink source code | custom format to consume Maxwell CDC data

When every piece of data you insert is OK, everything is fine. Association, aggregation and output.

But when we find that there is something wrong with a piece of data that has been calculated, it is a big problem. In fact, it is useless to change the final output value directly. This time, when the data triggers the calculation, the result will still be covered by the wrong data, because the intermediate calculation result has not been modified, and it is still a wrong value. What should I do? The withdrawal stream seems to be able to solve this problem, which is indeed a means to solve this problem. However, how can the withdrawal stream determine that the data read is to be withdrawn? In addition, how to trigger a withdrawal?

CDC solves these problems: after deserializing the data of the message middleware, it can identify whether the data is insert or delete according to the type; in addition, if you read the Flink source code, you will find that the data type after deserialization has changed, upgrading from row to rowdata, rowdata The ability to mark the data as withdrawn or inserted means that each operator can determine whether the data needs to be issued or withdrawn.

So much about the importance of CDC. If there is a chance, a real-time DQC video will be produced to tell you how helpful CDC is to real-time DQC. Let’s get back to the point.

Since there are so many CDC synchronization middleware, there must be a variety of formats stored in the message middleware, and we must parse them. So Flink 1.11 provides canal JSON and debezium JSON, but what if we use Maxwell? Can we wait for the government to come out or wait for someone to contribute to the community? What if we use self-developed synchronous middleware?

So we have today’s share: how to customize and implement a Maxwell format. You can also implement other CDC formats, such as Ogg, or the data format generated by self-developed CDC tools.

How to achieve this

After we submit the task, Flink will load all factory classes registered under classpath through SPI mechanism, including dynamictablefactory, deserializationformatfactory, etc. For format, which deserialization format factory to use depends on the format in the DDL statement. This is determined by matching the value of format with the return value of the factory class’s factoryidentifier() method.

Then, the deserialization object is provided to the dynamictablesource through the createencoding format (…) method in the deserializationformatfactory.

Figure out the whole process (only from the perspective of deserializing data and consuming it)

Flink source code | custom format to consume Maxwell CDC data

It is very simple to implement CDC format to parse the data generated by a certain CDC tool. There are actually three core components:

  • Factories(deserialization format factory): responsible for creating the corresponding deserializer according to ‘format’ =’maxwell JSON ‘at compile time. That is, Maxwell jsonformatfactory.
  • Deserialize class(deserialization schema): responsible for runtime parsing and converting CDC data into insert / delete / update messages recognized by Flink system according to fixed format, such as rowdata. That is, Maxwell JSON deserialization schema.
  • Service registration file: the service file meta-inf / services needs to be added/ org.apache.flink . table.factories.Factory And add a line of Maxwell jsonformatfactory class path we implemented.

Then through the code, we can see the details in the deserialization

public void deserialize(byte[] message, Collectorout) throws IOException {
       try {
           RowData row = jsonDeserializer.deserialize(message);
           String type = row.getString(2).toString(); // "type" field
           if (OP_INSERT.equals(type)) {
               RowData insert = row.getRow(0, fieldCount);
               insert.setRowKind(RowKind.INSERT);
               out.collect(insert);
           } else if (OP_UPDATE.equals(type)) {
               GenericRowData after = (GenericRowData) row.getRow(0, fieldCount); // "data" field
               GenericRowData before = (GenericRowData) row.getRow(1, fieldCount); // "old" field
               for (int f = 0; f < fieldCount; f++) {
                   if (before.isNullAt(f)) {
                       before.setField(f, after.getField(f));
                   }
               }
               before.setRowKind(RowKind.UPDATE_BEFORE);
               after.setRowKind(RowKind.UPDATE_AFTER);
               out.collect(before);
               out.collect(after);
           } else if (OP_DELETE.equals(type)) {
               RowData delete = row.getRow(0, fieldCount);
               delete.setRowKind(RowKind.DELETE);
               out.collect(delete);
           } else {
               if (!ignoreParseErrors) {
                   throw new IOException(format(
                       "Unknown \"type\" value \"%s\". The Maxwell JSON message is '%s'", type, new String(message)));
               }
           }
       } catch (Throwable t) {
           if (!ignoreParseErrors) {
               throw new IOException(format(
                   "Corrupt Maxwell JSON message '%s'.", new String(message)), t);
           }
       }
   }

Actually, it is not complicated: first, the byte array is de serialized into rowdata according to the schema of [data: row, old: row, type: String] through JSON deserializer, and then the type of data is determined according to the value of the “type” column: add, modify, delete; then, according to the data type, the data in the “data” or “old” area is extracted to assemble the insert / delete / update recognized by Flink The data were distributed.

It can be deserialized to the specified type of the object by deserializing it to the specified format of the rowondata. In our scenario, we need to read the data in the “data”, “old” and “type” sections of Maxwell data as follows.

{"database":"test","table":"product","type":"update","ts":1596684928,"xid":7291,"commit":true,"data":{"id":102,"name":"car battery","description":"12V car battery","weight":5.17},"old":{"weight":8.1}}

Therefore, the rowtype of JSON defined in Maxwell JSON deserialization schema is as follows.

private RowType createJsonRowType(DataType databaseSchema) {
       // Maxwell JSON contains other information, e.g. "database", "ts"
       // but we don't need them
       return (RowType) DataTypes.ROW(
           DataTypes.FIELD("data", databaseSchema),
           DataTypes.FIELD("old", databaseSchema),
           DataTypes.FIELD("type", DataTypes.STRING())).getLogicalType();
   }

Database schema is the schema information defined by the user through DDL, which also corresponds to the schema of the table in the database. Combined with the above JSON and code, we can know that jsondeserializer only takes the values corresponding to the three fields of data, old and type in byte []. Among them, data and old are nested JSON, and their schema information is consistent with that of database schema. Since Maxwell does not contain fields that have not been updated when synchronizing data, we will complete the missing fields in the old area through rowdata in the data area after jsondeserializer returns.

After getting rowdata, the type field will be taken out. According to the corresponding value, there will be three branches:

  • insert: take out the value in the data, that is, the value corresponding to the field defined by DDL, and then mark it as RowKind.INSERT Type data, which is issued at last.
  • update: take out the values of data and old respectively, and then loop through each field in old. If the field value is empty, it means that it is an unmodified field, and then replace it with the value of the corresponding position field in the data. After that, mark the old as RowKind.UPDATE_ Before means that the Flink engine needs to withdraw the corresponding value before, and the data is marked as RowKind.UPDATE_ After was normal.
  • delete: take out the value in the data and mark it as RowKind.DELETE , representing the need to withdraw.

During the processing, if an exception is thrown, the- json.ignore -The value of parse errors is used to determine whether to ignore this data and continue processing the next data, or to let the task report an error.

Based on the function of Maxwell JSON deserialization, the author also realizes the function of serialization, that is, it can output the changelog generated by Flink to the external system in the format of Maxwell JSON. The implementation idea is just opposite to the idea of deserializer. For more details, please refer to the implementation in pull request.

PR implementation details link:
https://github.com/apache/fli…

Function demonstration

Let’s show you how to read the Maxwell JSON format data pushed by Maxwell from Kafka, write the aggregated data to Kafka again, and then read it out again to verify whether the data is correct.

  • Kafka data source table
CREATE TABLE topic_products (
 -- schema is totally the same to the MySQL "products" table
 id BIGINT,
 name STRING,
 description STRING,
 weight DECIMAL(10, 2)
) WITH (
'connector' = 'kafka',
'topic' = 'maxwell',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'format' = 'maxwell-json');
  • Kafka data result table & data source table
CREATE TABLE topic_sink (
 name STRING,
 sum_weight DECIMAL(10, 2)
) WITH (
'connector' = 'kafka',
'topic' = 'maxwell-sink',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'format' = 'maxwell-json'
);
  • MySQL table
--Note that this part of SQL is executed in mysql, not a table in Flink
CREATE TABLE product (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255),
description VARCHAR(512),
weight FLOAT
);
truncate product ;
ALTER TABLE product AUTO_INCREMENT = 101;
INSERT INTO product
VALUES (default,"scooter","Small 2-wheel scooter",3.14),
      (default,"car battery","12V car battery",8.1),
      (default,"12-pack drill bits","12-pack of drill bits with sizes ranging from #40 to #3",0.8),
      (default,"hammer","12oz carpenter's hammer",0.75),
      (default,"hammer","14oz carpenter's hammer",0.875),
      (default,"hammer","16oz carpenter's hammer",1.0),
      (default,"rocks","box of assorted rocks",5.3),
      (default,"jacket","water resistent black wind breaker",0.1),
      (default,"spare tire","24 inch spare tire",22.2);
UPDATE product SET description='18oz carpenter hammer' WHERE id=106;
UPDATE product SET weight='5.1' WHERE id=107;
INSERT INTO product VALUES (default,"jacket","water resistent white wind breaker",0.2);
INSERT INTO product VALUES (default,"scooter","Big 2-wheel scooter ",5.18);
UPDATE product SET description='new water resistent white wind breaker', weight='0.5' WHERE id=110;
UPDATE product SET weight='5.17' WHERE id=111;
DELETE FROM product WHERE id=111;
UPDATE product SET weight='5.17' WHERE id=102 or id = 101;
DELETE FROM product WHERE id=102 or id = 103;

Let’s see if we can read the Maxwell JSON data in Kafka normally.

select * from topic_products;

Flink source code | custom format to consume Maxwell CDC data

As you can see, all the field values have changed to the values after update. Meanwhile, the deleted data does not appear.

Then let’s write the aggregate data to Kafka.

insert into topic_sink select name,sum(weight) as sum_weight from topic_products group by name;

We can also see that the task is submitted correctly on the web page of Flink cluster. Next, let’s find out the aggregate data.

select * from topic_sink

Flink source code | custom format to consume Maxwell CDC data

Finally, let’s query the table in MySQL to verify whether the data is consistent; because in flight, we define the weight field as decimal (10,2), so we need to type convert the weight field when querying mysql.

Flink source code | custom format to consume Maxwell CDC data

No problem, our Maxwell JSON parsing was successful.

Write it at the end

According to the author’s experience in implementing Maxwell JSON format, Flink is still very clear about the definition of interfaces and the division of module responsibilities, so it is very simple to implement a custom CDC format (the core code is only more than 200 lines). Therefore, if you use Ogg or self-developed synchronization middleware, you can quickly implement a CDC format through the ideas of this article, and liberate your CDC data together!

Reference link:

[1]https://github.com/alibaba/canal
[2]https://debezium.io/
[3]https://maxwells-daemon.io/

Recommended Today

Api: tiktok: user video list

Tiktok tiktok, tiktok, tiktok, Api, Api, jitter, voice, bullet screen comments, jitter, jitter, and jitter. Tiktok tiktok tiktok data, jitter data acquisition, live broadcast of shaking sound Titodata: professional short video data acquisition and processing platform. For more information, please contact:TiToData Massive data collection Collect 500 million pieces of data for customers every day Tiktok […]