[Kafka] Kafka principle learning notes

Time:2020-9-16

Two modes of message queue

[Kafka] Kafka principle learning notes

Point to point mode

Advantage: the consumer client can control the speed of consuming messages
Disadvantage: the consumer client needs a thread to continuously monitor messages in the queue

Publish / subscribe mode

There is no need for threads to monitor queues, but the speed at which clients consume data cannot be considered

Why message queuing is needed?

1. Decoupling: A and B do not need direct connection and synchronization, and communicate through middleware
2. Redundancy: message queue has cache and can backup data
3. Expansion: for example, Kafka cluster
4. Flexibility and peak processing capacity: the flexibility of expansion, and the performance improvement brought by expansion, processing peak value
5. Recoverability: the recoverability brought by redundant backup
6. Sequence assurance (within zones)
7. Buffering: balancing the speed of sending and consuming
8. Asynchronous communication

Kafka Architecture & Workflow

[Kafka] Kafka principle learning notes

  1. Kafka cluster has one or more brokers, and one broker is a machine in the cluster
  2. A topic can have one or more partitions, and a partition can have multiple replicas for replication. When there are multiple replicas, there are leaders and followers. At the same time, only the leader works outside, and the follower only communicates with the leader to maintain the replica. The leader and the follower cannot be on the same machine. This is mainly becauseFor load balancing

3. Consumers have consumer groups. Consumers of the same group cannot consume the same partition. Consumers of the same group can read data from different partitions at the same time through threadsImprove concurrency

Production data flow

Writing mode

Producer publishes messages to broker in push mode, and each message is appended to the corresponding partition, which belongs to sequential disk writing (higher than random write memory, ensuring throughput)

partition

The messages in each partition are ordered, and the produced messages are constantly appended to the partition log. Each message has an offset value in its partition

Partition reason

As mentioned above, in order to load balance and improve concurrency

Zoning principle

1. Custom partition
2. If there is no user-defined partition, but each message has a key value, partition according to the hash of the key
3. If neither partition method nor key value is specified, polling is used (polling is also for load balancing)

Producer writing process

[Kafka] Kafka principle learning notes

  1. After the producer determines the partition it needs to send to according to the partition principle, it first obtains the leader of the partition from the broker list
  2. Producer sends the message to the leader
  3. The leader writes the message to the local log
    (if ack = all)
  4. Followers pull messages from the leader
  5. Send ack to leader after writing local log
  6. After receiving the ACK from all the followers, the leader sends the ack to the producer

ACK confirmation mechanism:

ACK = 0, producer will not wait for the leader's confirmation message after sending data
ACK = 1 ", the producer waits for the confirmation message from the leader
ACK = all ", the leader waits for the confirmation messages from all followers. After receiving all the confirmation messages, the follower sends the confirmation messages to the producer, and the producer waits for the confirmation messages from the leader
Broker saves message
Storage mode

Physically, the topic is divided into one or more partitions. Each partition corresponds to a folder, which stores all messages and index files of the partition
If the schedule of ZK is specified, the metadata of consumer (such as the currently read offset) will be stored on ZK

Comparison and integration with flume

location

flume:

1. Suitable for multiple producers
2. Suitable for the situation where there are not many downstream consumers (each consumer should configure the replica channel)
3. Suitable for data security requirements are not high (most enterprises use memory channel for efficiency)
4. Suitable for docking with Hadoop ecosystem

kafka:

1. It is suitable for the situation of large consumption of data downstream (the change of the number of consumers does not need to change the configuration of Kafka cluster, and only one more consumer is needed when new consumers are needed)
2. It is suitable for high data security requirements and supports replication

The business model of the company is generally:
Flume multiple agents (collecting logs) = > flume single agent aggregation = > Kafka
There are two lines after Kafka

  1. Kafka = > (flume = >) = > HDFS, save logs, do offline analysis, such as update in the morning
  2. Kafka = > spark streaming, online business