Kafka: the TA that real time can’t do without


Kafka: the TA that real time can't do without

1、 Preface

With the development of technology and market demand, real-time development has become an indispensable part of the current big data development. In the whole real-time development link, data acquisition needs to be written to Kafka, and data processing also needs to use Kafka. Today, we will give a brief introduction to Kafka, the current mainstream message middleware.

2、 Message queue: the destination of data flow

In the scene of real-time development, the data from all kinds of behaviors and events keep flowing into the real-time task as a river, and constantly produce results. In traditional heterogeneous data sources, data is stored in the corresponding library table in a structured form. In addition to the business time attribute contained in the data itself, how to find a stable time dimension to describe the sequence of these data? Where to process the streaming data?

Message queue is to deal with a large number of data need to be delivered, analysis scenarios involved.

At present, there are two ways of message queuing

  • Point to point (queue): after a message is consumed by any consumer, it disappears in the point-to-point system. The message is kept in the queue. One or more consumers can consume the message in the queue, but a specific message can only be consumed by at most one consumer. Once a consumer reads the message in the queue, it disappears from the queue.
  • Publish / subscribe (topic): a message can be consumed by all subscribers (groups). In a publish / subscribe system, a message producer is called a publisher, and a message consumer is called a subscriber. Different from peer-to-peer system, a consumer group can subscribe to one or more topics and use all messages in the topic. Similarly, all messages published to topics can be consumed by all subscription groups. A subscription group may contain multiple subscribers.

In order to better understand the operation mode of message queuing, we first imagine the following scenario: data is an express, and the flow of data between different development links is the delivery process of express.
1. TV shopping: door to door delivery, customer sign in

Ten years ago, when TV shopping was still quite popular, most of the goods were delivered by express companies such as post. After the courier came to the door, he would ask the customer to sign on the waybill for acceptance. At this time, only after each express has been signed and accepted by the customer, the courier will start the transportation of the next item (this is an example in extreme cases).

When a customer has more than one express delivery, and more than one express delivery arrives in succession, there will be such a repeated link as courier delivery – waiting for sign in – customer sign in – Courier returns to the receiving and sending point to find a new express delivery – courier delivery. If there is a situation of slow customer response and slow signing speed, it will take more time.

Similarly, in the traditional data development scenario, data transmission also follows this rule. Data transmission between upstream and downstream services is equivalent to the process of express delivery. If a data transmission needs to wait until the receipt from downstream services to ensure the normal writing of data, then the processing speed and response speed of downstream services will seriously affect the data in this link, resulting in data delay; If the whole data transmission link contains multiple such processes, the timeliness of the whole data cannot be guaranteed.
2. Express Logistics: unified Express Station

With the continuous development of online shopping, in order to improve efficiency, great changes have taken place in the way of goods distribution. Now, the courier starts from the receiving and sending point to pick up the goods and deliver the goods to the express station in the corresponding area. The express station signs for the actual user once. At this time, it is considered that the process of express delivery has been completed. The courier can quickly return to the picking point, and the subsequent express station will inform the specific users in various forms. If there is a corresponding express that needs to be signed, they will come to the express point to pick it up before “some time point”. For users, it only needs to continue to pay attention to the status of the express station (subscription), when there is express, it can get it in time.

When we are familiar with the process of express delivery from storage in the warehouse to delivery to the recipient, we can understand how message middleware works in the process of real-time development. So in a variety of message middleware, the most widely used one is Apache Kafka.

3、 Kafka: Message Middleware

Apache Kafka is a distributed, partition and replica supporting distributed message system based on zookeeper coordination. It is used to process large amounts of data in real time and is often used in big data, data mining and other scenarios.

Kafka often involves the following basic concepts:

  • Zookeeper: used to configure independent brokers as Kafka clusters;
  • Broker: Kafka cluster contains one or more servers, which are called brokers;
  • Topic: the message topic in Kafka, similar to the concept of table, is used to distinguish different messages;
  • Partition: topic partition. Each topic can have multiple partitions. The function of partition is to facilitate expansion and improve concurrency.Kafka: the TA that real time can't do without

For easy understanding, we can simply compare Kafka with express delivery process as follows:Kafka: the TA that real time can't do without

1. Data writing

1) Determine topic and partition

There may be multiple partitions under a topic. When writing data to Kafka, you need to first determine the topic and the corresponding partition.

2) Find the partition address

Because Kafka achieves high availability, after determining to write partition, producer will get the leader of the corresponding partition from ZK and communicate with it.

3) Data transmission

  • The leader receives the producer information and writes it to the local log
  • Other followers pull information from the leader, write it to the local log, and then send ack to the leader
  • The leader receives all the follower information, sets a HW (high watermark), and then sends an ACK to the producerKafka: the TA that real time can't do without

2. Consumption pattern and distribution strategy

In the case of actual consumption data, the consumer in Kafka will interact with the topic in the form of consumer group and allocate the corresponding partition. In the process of consumption, the data in a group is not repeated, but the data among multiple groups can be consumed repeatedly, which is also the characteristic of publish subscribe system.

Developers can use this feature to realize real-time monitoring of business data without affecting the main business process.Kafka: the TA that real time can't do without

A group contains at least one consumer and a topic contains at least one partiton. Multiple consumers in a consumer group can consume different partitions in parallel, so as to improve the parallelism of Kafka data consumption and the speed of data processing. However, in the process of consumption, according to the different number of partitions and consumers, there will be various situations. Kafka has corresponding allocation strategies for different situations, which can be referred to as follows:Kafka: the TA that real time can't do withoutKafka: the TA that real time can't do without

4、 How to use Kafka in real time development

In the actual production, real-time development also uses a consumer group or producer group to consume the corresponding data in Kafka.

Kafka: the TA that real time can't do without

In the process of real-time collection task, the data of data source is collected to Kafka. By setting different write concurrency numbers, multiple producers can be set to write data to the same topic to improve concurrency and data reading efficiency; Similarly, when collecting Kafka data sources, by setting different read concurrency numbers, multiple consumers can be set in a group to consume the data in the topic at the same time.

In the real-time development task, the parallelism of Kafka data source can also be set, so as to adjust the parallelism according to the actual business needs to meet the consumption needs.

5、 Conclusion

Through today’s introduction, we learn how Kafka, as a typical publish subscribe message queue, can help users temporarily store streaming data, and realize multiple concurrent read and write through the mechanism of consumer group and partition, so as to improve the efficiency of real-time development. We will continue to introduce the content related to real-time development in the future. Please look forward to it.

Counting stackWe have an interesting open source project on GitHub and giteeFlinkXFlinkXIt is a unified batch data synchronization tool based on Flink. It can collect both static data and real-time data. It is a global, heterogeneous and batch data synchronization engine. If you like, please give us a star! star! star!

GitHub open source project:https://github.com/DTStack/fl…

Gitee open source project:https://gitee.com/dtstack_dev…