How does real time data synchronization service (Canal + Kafka) ensure the order of messages?

Time:2021-3-3

 

LastThis paper introduces the whole architecture of real-time data synchronization of data migration platform; 
This paper mainly introduces how to ensure the sequence of messages in real-time data synchronization of data migration platform.

AccesshereSee more original articles about big data platform construction.

1、 What is the sequence of messages?

  1. The message producer sends the message to the same partition of the same MQ server and sends it in order;

  2. The consumers consume according to the order of message sending.

2、 Why guarantee the order of messages?

In some business function scenarios, it is necessary to ensure that the sending and receiving order of messages is consistent, otherwise the use of data will be affected.

Scenarios where messages need to be ordered

Real time data synchronization of mountain moving canalThe component subscribes to the log of MySQL database and posts it to Kafkahere); 
Kafka consumer then processes the data according to the specific data usage scenarios (stored in HBase, MySQL or directly for real-time analysis); 
Since binlog itself is ordered, it is necessary to guarantee the order after writing to MQ.

  1. Suppose we create a real-time synchronization task and subscribe to the order table of a business database;

  2. For upstream business, if an order is inserted into the order table, and then an update operation is performed on the order, the binlog will automatically write the data of the insert operation and update operation, and these data will be delivered to Kafka broker by canal server;

  3. If the Kafka consumer first consumes the update log and then consumes the insert log, then an exception will occur due to the lack of data when operating in the target table.

3、 How to ensure the sequence of messages in Mt. mobile real time synchronization service

The overall process of real-time synchronization service message processing is as follows:

We mainly guarantee the sequence of messages through the following two aspects.

1. Send the messages that need to guarantee the sequence to the same partition

1.1 the messages in the same partition of Kafka are ordered
  • The same partition of Kafka is organized by a write ahead log, which is an ordered queue, so the order of FIFO can be ensured;

  • Therefore, if the producer sends messages in a certain order, the broker will write the messages to the partition in this order, and the consumer will read the messages in the same order;

  • Each partition of Kafka will not be consumed by two consumer instances at the same time, so the order of message consumption can be ensured.

1.2 control the distribution of the same key to the same partition

To ensure that the order of multiple modifications of the same order arriving in Kafka can not be disordered, when producer inserts data into Kafka, it can control the same key (which can be realized by using the order key hash algorithm) to be sent to the same partition, so that the same order will fall into the same partition.

1.3 configuration of canal

The current MQ supported by canal arekafka/rocketmqIn essence, they are based on local files to support the ability of partition level sequential messages. We only need to start the following configuration when configuring instance:

1> canal.properties

#The leader node will wait for all the replicas in the synchronization to confirm whether the record is sent
canal.mq.acks = all

remarks:

  • In this way, as long as at least one synchronous copy exists, records will not be lost.

2> instance.properties

1 # the number of partitions in hash mode
2 canal.mq.partitionsNum=2
3. Hash rule defines the name of the database. Table name: unique primary key. Multiple tables are separated by commas
4 canal.mq.partitionHash=test.lyf_canal_test:id

remarks:

  • The binlog data generated by adding, deleting and modifying the same data will be written to the same partition;

  • To view the message of the specified partition of the specified topic, you can use the following command:

    bin/kafka-console-consumer.sh --bootstrap-server serverlist --topic topicname --from-beginning --partition 0

2. Scrambling by log time stamp and log offset

Sending the same order data to the same partition by specifying the key can solve the problem of data disorder in most cases.

2.1 special scenes

For a message a and B with sequence, normally a should send it first and then send it to B. However, under abnormal conditions:

  • A failed to send, B succeeded in sending, and a succeeded in retrying after B finished sending due to the retrying mechanism;

  • At this time, the message order of AB becomes ba.

The real-time synchronization service of Mt. Yishan will add a layer of disorder processing before storing the subscribed data into HBase.

2.2 two important information in binlog

use mysqlbinlogView binlog:

/usr/bin/mysqlbinlog --base64-output=decode-rows -v /var/lib/mysql/mysql-bin.000001

Execution time and offset:

remarks:

  1. Each data has two important information: execution time and offset,The core of the verification logic below is based on these two values

  2. The executed SQL statements are stored in the binlog in the base64 encoding format. If you want to view the SQL statements, you need to add:--base64-output=decode-rows -vParameters to decode;

  3. Offset:

    • Position represents where binlog is written to this offset, that is, the size of the current binlog file;

    • That is to say, the position of writing data later must be larger than that of writing data first,Therefore, the order of messages can be determined according to the position size.

3. Demonstration of message out of order processing

3.1 insert a piece of data into the subscription table, and then update it twice
MariaDB [test]> insert into lyf_canal_test (name,status,content) values('demo1',1,'demo1 test');
Query OK, 1 row affected (0.00 sec)
 
MariaDB [test]> update lyf_canal_test set name = 'demo update' where id = 13;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
 
MariaDB [test]> update lyf_canal_test set name = 'demo update2',content='second update',status=2 where id = 13;
Query OK, 1 row affected (0.00 sec)
3.2 generate three messages that need to be ordered

holdInsert, first update, second updateThe messages that binlog generated by these three operations are pushed to Kafka by canal server are called:Message a, message B, message C

  • Message a: 

  • Message B: 

  • Message C: 

3.3 disordered messages caused by network

Suppose that due to unknown network reasons:

  • The three messages received by Kafka broker are as follows:Message a, message C, message B

  • Then the order of the three messages consumed by Kafka consumer is as follows:Message a, message C, message B

  • This causes the message to be out of order, soSubscribing data must be scrambled before it is stored in the target table

3.4 message out of order processing logic

We use the characteristics of HBase to make the data primary key the rowkey of the target table. When Kafka consumer end consumes data, the main process of disorder processing (extracted from the technical white paper of Xiyun digital core big data platform) is as follows:

The three message processing processes of demo are as follows: 
1> If the primary key ID of message a as rowkey does not exist in the target table of HBase, the data of message a will be directly inserted into HBase 

2> If the primary key ID of message C is rowkey and already exists in the target table, you need to judge the execution time of message C and the execution time stored in the table

  • If the execution time in message C is less than the execution time stored in the table, it is proved that message C is a duplicate message or out of order message and is discarded directly;

  • If the execution time in message C is greater than the execution time stored in the table, the table data will be updated directly (this demo is in line with this scenario) 

  • If the execution time in message C is equal to the execution time stored in the table, the offset of message C and the offset stored in the table should be used to judge:

    • If the offset in message C is less than the offset stored in the table, message C is proved to be a duplicate message and is discarded directly;

    • If the offset in message C is greater than or equal to the offset stored in the table, the table data will be updated directly.

3> If the primary key ID of message B is rowkey and already exists in the target table, you need to judge the execution time of message B and the execution time stored in the table

  • Since the execution time in message B is less than that stored in the table (that is, the execution time of message C), message B is discarded directly.

3.5 main codes

Kafka consumer will consume the message format processing and assembly, and with the help of HBase-client APITo complete the operation of HBase table.

1> UsePutAssembly line data

/**
*Package name: org.apache.hadoop . hbase.client.Put
*Hbasedata is the data subscribed from binlog, which is the target HBase table through circulation
*Add rowkey, column cluster and column data.
*Function: used to add a single row.
*/
Put put = new Put(Bytes.toBytes(hbaseData.get("id")));
//Hbasedata adds column clusters and columns to the target HBase table through a loop for the data subscribed from binlog
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes(mapKey), Bytes.toBytes(hbaseData.get(mapKey)));

 

2> Use checkAndMutate, updateHBaseTable data

The put operation will be submitted to the server only when the column data of the rowkey corresponding to the server meets the expected conditions (greater than, less than, or equal to).

//If update_ Info (column family) execute_ If the time (column) value does not exist, the data will be inserted. If it exists, false will be returned
boolean res1 = table.checkAndMutate(Bytes.toBytes(hbaseData.get("id")), Bytes.toBytes("update_info")) .qualifier(Bytes.toBytes("execute_time")).ifNotExists().thenPut(put);
 
//If it exists, compare the execution time
if (!res1) {
//If the execution time of this pass is longer than that in HBase, put is inserted
boolean res2 =table.checkAndPut(Bytes.toBytes(hbaseData.get("id")), Bytes.toBytes("update_info"),
Bytes.toBytes("execute_time"), CompareFilter.CompareOp.GREATER, Bytes.toBytes(hbaseData.get("execute_time")),put);
 
//When the execution time is equal, the offset is compared. If the value passed this time is greater than the value in HBase, put is inserted
if (!res2) {
boolean res3 = table.checkAndPut(Bytes.toBytes(hbaseData.get("id")),
Bytes.toBytes("update_info"), Bytes.toBytes("execute_position"), CompareFilter.CompareOp.GREATER, Bytes.toBytes(hbaseData.get("execute_position")),put);
}
}

 

4、 Summary

  1. At present, the Kafka consumer uses a thread to consume data;

  2. If there is a need to upgrade the version in the future, when changing the consumer side to multiple threads to consume data, we should consider that the ordered messages will be disrupted when multithreading consuming.

WeChat official account

Welcome to my WeChat official account for more articles:

Recommended Today

The default value is displayed when the flutter text box is initialized

At the beginning of the flutter text box, we usedTextField. There seems to be no problem in most cases. The code form is as follows: class _FooState extends State<Foo> { TextEditingController _controller; @override void initState() { super.initState(); _ Controller = new texteditingcontroller (text: ‘initialization content’); } @override Widget build(BuildContext context) { return new Column( children: […]