Usually, enterprises will adopt polling or random methods to produce data to Kafka cluster through Kafka producer, so as to ensure that the data between kafk partitions is evenly distributed as far as possible.
On the premise of uniform distribution of partition data, if we design a reasonable number of Kafka partitions according to the amount of topic data to be processed and other factors.Big data trainingFor some real-time tasks, such as spark streaming / structured streaming, Flink and Kafka integrated applications, the consumer does not “hang up” for a long time, that is, the data is continuously consumed, so Kafka data backlog generally does not occur.
However, there are preconditions. When some unexpected or unreasonable partition number settings occur, the backlog problem is inevitable.
Typical scenario of Kafka message backlog:
1. Real time / consumption tasks hang up
For example, the real-time application we wrote died for some reason, and the task was not monitored by the monitoring program. If it was found, the relevant person in charge was notified, and the person in charge did not write a script to automatically pull up the task to restart.
Before we restart the real-time application for consumption, the messages in this period of time will be delayed. If the amount of data is large, it can not be solved by simply restarting the application for direct consumption.
2. The number of Kafka partitions is unreasonable (too few) and the “consumption capacity” of consumers is insufficient
The speed QPS of Kafka single partition production messages is usually very high. If consumers spend time differently due to some reasons (for example, due to the complexity of business logic), there will be consumption lag.
In addition, the number of Kafka partitions is the smallest unit for Kafka parallelism tuning. If the number of Kafka partitions is set too small, it will affect the throughput consumed by Kafka consumers.
3. The key of Kafka message is uneven, resulting in uneven data between partitions
When using Kafka producer messages, you can specify a key for the message, but the key should be uniform, otherwise there will be data imbalance between Kafka partitions.
So, in view of the above situation, what is a good way to deal with the data backlog?
In general, targeted solutions include the following:
1. Consumption lag caused by real-time / consumption tasks hanging up
a. After the task is restarted, the latest messages will be consumed directly, and the “lag” historical data will be “patched” by offline program.
In addition, it is recommended to incorporate the task into the monitoring system. When there is a problem with the task, timely notify the relevant person in charge for handling. Of course, the task restart script is also necessary. It also requires strong exception handling ability of the real-time framework to avoid the inability to restart the task caused by non-standard data.
b. The task starts consumption processing from the last submitted offset
If the backlog of data is large, it is necessary to increase the processing capacity of the task, such as increasing resources, so that the task can consume and process as quickly as possible and catch up with the latest consumption news
2. There is less Kafka partition
If the amount of data is large, the key is to reasonably increase the number of Kafka partitions. If spark flow and Kafka direct approach are used, Kafka RDD can also be repartitioned to increase parallelism.
3. The setting of Kafka message key is unreasonable, resulting in unbalanced partition data
You can add a random suffix to the key at Kafka producer to make it balanced.
Recommended Today
XML grammar rules
version: XML default version number, this attribute must exist encoding: The encoding of this XML file Other Components of XML Annotation information can be defined in the XML file: The following special characters can exist in an XML file < > > greater than & & ampersand ' ' apostrophe " " quotation marks A […]
- [Causal inference paper] China's new crown mortality rate is higher? – Simpson's paradox of Covid-19 mortality
- SSM framework project sharing "CRM customer management system" Spring+Mybatis+SpringMVC framework project actual combat
- Convert the pth weight of pytorch to the ckpt file of tensorflow
- MongoDB Chinese Community Beijing Conference 2021
- Smart ships bring port revolution, visualization becomes the main force
- Paper recommendation: TResNet improves ResNet to achieve high-performance GPU-specific architecture and is better than EfficientNet
- Tuoduan tecdat | Shanghai Muji Spatial Distribution Characteristics and Location Strategy Research
- TensorFlow's distributed training optimization practice in recommender systems
- Python Learning Manual (Introduction & Crawler & Data Analysis & Machine Learning & Deep Learning)
- The backbone of domestic BI is rising! Smartbi debuted at the 2021 Digital Ecology Conference and won three awards