Kafka introduction

Time:2020-2-26

In the previous article, I introduced the use scenario of message queuing, now I will introduce Kafka

Main features of Kafka:

  • Provides high throughput for both publishing and subscription. It is understood that Kafka can produce about 250000 messages (50 MB) per second and process 550000 messages (110 MB) per second.
  • Persistent operation is available. Persist messages to disk, so they can be used for bulk consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to the hard disk and replication.
  • Distributed system, easy to expand. All producers, brokers, and consumers will have multiple, all distributed. The machine can be extended without stopping.
  • The status of the message being processed is maintained on the consumer side, not the server side. It can balance automatically when it fails.
  • Online and offline scenarios are supported.

Kafka architecture

Kafka’s overall architecture is very simple. It is an explicit distributed architecture. There can be multiple producers, brokers (Kafka) and consumers. Producer and consumer implement the Kafka registration interface, and the data is sent from producer to broker, which is responsible for an intermediate cache and distribution. Broker distributes consumers registered in the system. The role of broker is similar to caching, that is, caching between active data and offline processing systems. The communication between client and server is based on simple, high performance and programming language independent TCP protocol.

Some basic concepts

  • Topic: refers to the different categories of feeds of messages processed by Kafka.
  • Partition: the physical grouping of topics. A topic can be divided into multiple partitions. Each partition is an orderly queue. Each message in the partition is assigned an ordered ID (offset).
  • Message: message is the basic unit of communication. Each producer can publish some messages to a topic.
  • Producers: message and data producers. The process of publishing messages to a topic in Kafka is called producers.
  • Consumers: the process of message and data consumers subscribing to topics and processing their published messages is called consumers.
  • Broker: cache agent, one or more servers in Kafka cluster are collectively referred to as broker.

Message sending process

1. Producer publishes the message to the partition of the specified topic according to the specified partition method (round robin, hash, etc.)

2. After receiving the message from producer, Kafka cluster will persist it to the hard disk, and keep the specified time (configurable) of the message, regardless of whether the message is consumed.

3. The consumer pulls data from Kafka cluster and controls the offset of getting messages

Message storage policy

When it comes to Kafka storage, we have to mention partitions, that is, partitions. When creating a topic, you can specify the number of partitions at the same time. The more partitions, the greater the throughput, but the more resources you need, and the higher the unavailability. After receiving the messages sent by the producers, Kafka will store the messages in different partitions according to the equalization strategy.

In each partition, messages are stored in order, and the latest received messages are consumed.

Interaction with producers

  • When the producer sends messages to the Kafka cluster, it can send them to the specified partition through the specified partition
  • You can also send messages to different partitions by specifying an equalization strategy
  • If not specified, the message will be randomly stored in different partitions using the default random equalization strategy

Interaction with consumers

  • When consumers consume messages, Kafka uses offset to record the current consumption position
  • In Kafka’s design, there can be multiple different groups to consume messages under the same topic at the same time. As shown in the figure, we have two different groups to consume messages at the same time, and their consumption record location offset does not interfere with each other.
  • For a group, the number of consumers should not be more than the number of partitions, because in a group, each partition can only be bound to one consumer at most, that is, one consumer can consume multiple partitions, and one partition can only consume one consumer
  • If the number of consumers in a group is greater than the number of partitions, the redundant consumers will not receive any messages.