Kafka real combat: how to reduce Kafka message delay by 10 times per second



A large domestic tax system, business application distributed cloud transformation.

Business challenges
Kafka real combat: how to reduce Kafka message delay by 10 times per second

As shown in the figure above, it is a concurrent access model that simulates the business web pages of customers. The user clicks on the page to generate an HTTP request. When the request is sent to the business production process, a delivery thread will be started to call Kafka’s SDK interface, and three messages will be sent to DMS (distributed message service). The size of each message is 3K. The request response will not be returned until all three messages are processed. When the message reaches the DMS, the business consumption process calls Kafka’s consumption interface to take out the message, and then puts each message into a response thread for processing. After the response thread finishes processing, it notifies the delivery thread through HTTP request, and the delivery thread returns the response after receiving the response.

The delay of 100 concurrent access is 500ms, which fails to meet the user’s business requirements

Customers put forward clear requirements: each two core ECS should be able to support 100 concurrent visits, and the end-to-end delay range of each message is tens of milliseconds, that is, the time from the beginning of producer sending to receiving consumer response. After using the Kafka queue of DMS, the delay reaches about 500ms when the concurrent access is 100, even the delay reaches second level, which is far from meeting the customer’s business demands. Comparatively speaking, customers use their own native Kafka in the pod area, and the test delay is only about 10 ~ 20ms when the concurrent access is 100. So the question is, under the condition of the same concurrent traffic, why is the delay so different between the Kafka queue in DMS and the native Kafka queue in pod? Mr. Peng, our DMS architect, has perfectly solved this customer problem after a series of analysis on this delay problem. Let’s take a look at his mental journey.

Analysis of difficult problems

According to the simulated customer service model, Mr. Peng has also constructed a test program on Huawei cloud production environment, and also simulated and constructed 100 concurrent visits. Through the test, it is found that the average delay time obtained by pressure test on Huawei cloud production environment is about 60ms. What’s the difference between the delay value of class production and that measured by customers in real production environment? The problem has become complicated.

Mr. Peng made a quick decision and decided to run the constructed test program on Huawei cloud now network to see what the reason is. At the same time, the same test program is deployed on the client’s ECS server to simulate the construction of 100 concurrency. The comparison table of delay results is as follows:
Kafka real combat: how to reduce Kafka message delay by 10 times per second

Table 1 time delay comparison between Huawei cloud network and similar production environment

From the results of the delay comparison table, Mr. Peng found that even under the same concurrent pressure, the delay of Huawei cloud network is much worse than that of similar production. Mr. Peng realized that there are two problems to be analyzed: why is the delay of Huawei cloud network worse than that of similar production? How to solve the problem that the delay performance of DMS Kafka queue is worse than that of native Kafka queue? Mr. Peng made the following analysis:

Delay Analysis

The essence of regression problem is how the delay of DMS Kafka queue is generated? What are the controllable end-to-end delays? Mr. Peng gave the following formula

Total delay = queue delay + send delay + write delay + copy delay + pull delay

Queue entry delay: the time taken for a message to enter the queue of the sending partition after entering the Kafka SDK, and then to be sent after the message is packaged.

Sending delay: the time when a message is sent from the producer to the server.

Write delay: the time when a message is written to the Kafka leader.

Replication delay: consumers can only consume messages below the high water level (that is, messages saved by multiple replicas), so the time from the message being written to Kafka leader, to all replicas being written to the message, until it rises to the high water level is the delay of message replication.

Pull delay: the time taken by consumers to pull data in pull mode.

(1) Queue time delay

Which part of the current network has the greatest delay? Through our program, we can see that the delay of waiting in the queue is very large, as shown in the following figure:
Kafka real combat: how to reduce Kafka message delay by 10 times per second

That is, messages are waiting in the queue of the production side, there is no time to send!

Let’s look at other delay analysis. Because it can’t be tested in the existing network, we have tested the same pressure in class production, and the other delays are as follows:

(2) Replication delay

The following is class production environment test 1 and issued
Kafka real combat: how to reduce Kafka message delay by 10 times per second

From the perspective of logs, replication delay is included in remotetime. Of course, this time also includes the slow write delay of producers. However, to a certain extent, replication delay is also a factor to improve performance delay.

(3) Write delay

Because users use high throughput queues and write to disks asynchronously, we can see from the logs that the write latency is very low (Localtime), so we can judge that it is not a bottleneck

Transmission delay and pull delay are related to network transmission. This optimization is mainly determined by adjusting TCP parameters.