How to deal with Kafka cluster message backlog in big data development


Usually, enterprises will adopt polling or random methods to produce data to Kafka cluster through Kafka producer, so as to ensure that the data between kafk partitions is evenly distributed as far as possible.
On the premise of uniform distribution of partition data, if we design a reasonable number of Kafka partitions according to the amount of topic data to be processed and other factors.Big data trainingFor some real-time tasks, such as spark streaming / structured streaming, Flink and Kafka integrated applications, the consumer does not “hang up” for a long time, that is, the data is continuously consumed, so Kafka data backlog generally does not occur.
How to deal with Kafka cluster message backlog in big data development
However, there are preconditions. When some unexpected or unreasonable partition number settings occur, the backlog problem is inevitable.
Typical scenario of Kafka message backlog:
1. Real time / consumption tasks hang up
For example, the real-time application we wrote died for some reason, and the task was not monitored by the monitoring program. If it was found, the relevant person in charge was notified, and the person in charge did not write a script to automatically pull up the task to restart.
Before we restart the real-time application for consumption, the messages in this period of time will be delayed. If the amount of data is large, it can not be solved by simply restarting the application for direct consumption.
2. The number of Kafka partitions is unreasonable (too few) and the “consumption capacity” of consumers is insufficient
The speed QPS of Kafka single partition production messages is usually very high. If consumers spend time differently due to some reasons (for example, due to the complexity of business logic), there will be consumption lag.
In addition, the number of Kafka partitions is the smallest unit for Kafka parallelism tuning. If the number of Kafka partitions is set too small, it will affect the throughput consumed by Kafka consumers.
3. The key of Kafka message is uneven, resulting in uneven data between partitions
When using Kafka producer messages, you can specify a key for the message, but the key should be uniform, otherwise there will be data imbalance between Kafka partitions.
So, in view of the above situation, what is a good way to deal with the data backlog?
In general, targeted solutions include the following:
1. Consumption lag caused by real-time / consumption tasks hanging up
a. After the task is restarted, the latest messages will be consumed directly, and the “lag” historical data will be “patched” by offline program.
In addition, it is recommended to incorporate the task into the monitoring system. When there is a problem with the task, timely notify the relevant person in charge for handling. Of course, the task restart script is also necessary. It also requires strong exception handling ability of the real-time framework to avoid the inability to restart the task caused by non-standard data.
b. The task starts consumption processing from the last submitted offset
If the backlog of data is large, it is necessary to increase the processing capacity of the task, such as increasing resources, so that the task can consume and process as quickly as possible and catch up with the latest consumption news
2. There is less Kafka partition
If the amount of data is large, the key is to reasonably increase the number of Kafka partitions. If spark flow and Kafka direct approach are used, Kafka RDD can also be repartitioned to increase parallelism.
3. The setting of Kafka message key is unreasonable, resulting in unbalanced partition data
You can add a random suffix to the key at Kafka producer to make it balanced.

Recommended Today

On the mutation mechanism of Clickhouse (with source code analysis)

Recently studied a bit of CH code.I found an interesting word, mutation.The word Google has the meaning of mutation, but more relevant articles translate this as “revision”. The previous article analyzed background_ pool_ Size parameter.This parameter is related to the background asynchronous worker pool merge.The asynchronous merge and mutation work in Clickhouse kernel is completed […]