Last week, I made an internal sharing in the company about Kafka popular science. Summary output:
What is Kafka?
Open source message engine system. Stream processing platform. We are talking more about “message queuing”.
What is stream processing?
A stream is data. Processing is action. Stream processing is the action of continuously calculating the results of data. Its applicable scenarios are more:
- Monitoring alarm
- Log stream processing
- Bi model training
What do we often say about MQ?
message queue。 Message queue
Messages are data. A queue is a container for messages. We should be familiar with the first in first out data structure.
So what is its essence?
Issue deposit receipt
What are the advantages and disadvantages of Kafka in MQ?
Quoted fromMQ comparison and selection
|Single machine throughput||The throughput is one order of magnitude lower than that of rocketmq and Kafka||The throughput is one order of magnitude lower than that of rocketmq and Kafka||Class 100000, rocketmq is also an MQ that can support high throughput||At 100000 level, Kafka’s biggest advantage is its large throughput. It generally cooperates with big data systems for real-time data calculation, log collection and other scenarios|
|Impact of topic number on throughput||–||–||Topics can reach hundreds or thousands of levels, and the throughput will decrease slightly. This is a major advantage of rocketmq, which can support a large number of topics under the same number of machines||When there are dozens to hundreds of topics, the throughput will decrease significantly. Therefore, under the same number of machines, Kafka tries to ensure that the number of topics is not too large. If large-scale topic is supported, more machines need to be added|
|Timeliness||Ms level||Microsecond level, which is a major feature of rabbitmq, and the delay is the lowest||Ms level||The delay is within ms level|
|usability||High availability based on master-slave architecture||High availability based on master-slave architecture||Very high, distributed architecture||Very high. Kafka is distributed. There are multiple copies of one data. A few machines go down without losing data and unavailability|
|Message reliability||There is a low probability of data loss||–||After parameter optimization, zero loss can be achieved||After parameter configuration, the message can achieve zero loss|
|Functional support||Functions and completeness of MQ domain||It is developed based on Erlang, so it has strong concurrency performance, excellent performance and low latency||MQ has complete functions and good distributed scalability||The function is relatively simple. It mainly supports the add order MQ function|
|advantage||It is very mature and powerful, and has been applied in a large number of companies and projects in the industry||Erlang language development, excellent performance, low latency, 10000 class throughput, complete MQ functions, very good management interface and active community; Internet companies use more||The interface is simple and easy to use. Alibaba products have guaranteed throughput, convenient distributed expansion and active community. It supports large-scale topics and complex business scenarios. It can be customized based on the source code||Ultra high throughput, ms level delay, high availability and reliability, and convenient distributed expansion|
|inferiority||Occasionally, there is a low probability of losing messages, and the community activity is not high||The throughput is low, the Erlang voice development is not easy to customize, and the cluster dynamic expansion is troublesome||The interface does not follow the standard JMS specification. Some system migrations need to modify a lot of code, and the technology is at risk of being abandoned||Repeated consumption of messages is possible|
|application||It is mainly used for decoupling and asynchrony, and is less used in large-scale throughput scenarios||Are used||Used in large-scale throughput and complex business||It is widely used in real-time calculation and log collection of big data, which is the industry standard|
Why did XXX choose Kafka as the unified queue? (omitted)
- Maintenance cost
- High availability
- Technology stack
What are the performance advantages of Kafka?
Zero copy – for read
What is zero copy?
Zero copy（Zero-copy）Technology, because we do not copy data at the memory level, that is, we do not transport data through the CPU in the whole process, and all data is transmitted through DMA.
For log scenarios, compression can be considered. Compression is not recommended for other scenarios. Compression consumes additional CPU.
If you send synchronously, there will be no batch sending. For batch sending, this batch of messages will be compressed together, and for single sending, each message will be compressed separately. We know that when the file is very small, the effect of gzip compression is very poor, and it may even be larger than the source file.
Sequential write disk
In the case of sequential read and write, the sequential read and write speed of the disk is the same as that of the memory
Because the hard disk is a mechanical structure, every read and write will address – > write. Addressing is a “mechanical action”, which is the most time-consuming. Therefore, the hard disk hates random I / O and prefers sequential I / O. In order to improve the speed of reading and writing hard disk, Kafka uses sequential I / O.
Batch read / write
The Kafka consumer can pull up multiple pieces of data at one time, and finally submit offset uniformly.
The Kafka sender can also store multiple messages and send them at one time
batch.sizeTo set the batch size. However, this can only be used for a single partition, that is, multiple messages are sent to the same partition.
Both the consumer and the sender have two parameters to control the batch strategy. One is related to size and the other is related to time. As long as they meet one of the conditions, they can meet the conditions. You can find out for yourself.
At present, we use more batch processing on the consumer side.
Partition segmentation + index
This involves Kafka’s storage model
Let’s start with a Kafka look:
Step 1: Download + start ZK + start server + create topic
Step 2: send several messages
Step 3: consumption news
Step 4: view the log file
Step 5: view the index file
Step 6: view the time index file
Through the above practical observations, we find that:
- Each partition generates a log folder
Each folder contains at least 3 files, collectively referred to as segment (this is a logical grouping)
- . index file. The offset index file is the mapping between the message offset and the physical address, which is convenient to quickly locate the position of the message in the log file
- . log file. Detailed log file
- . timeindex file. The time index file finds the corresponding offset information according to the time stamp, that is, the corresponding. Index file.
- The offset in the. Index file and the timestamp in the. Timeindex file are monotonically increasing. Why? Because Kafka’s index mechanism adopts sparse index, which is simply segmentation. Kafka does not store all the mapping relationships between offset and physical address. It stores one at an interval and one at an interval. If you want to find a message with known offset, first find the segment in which the current offset is based on offset. Find the offset at the beginning and end of the segment, find the corresponding interval binary according to the log file of the offset at the beginning and end of the segment, and find the corresponding MSG.
Then why does Kafka index adopt sparse index?
Prevent the index file from being too large and difficult to find
Then why Kafka do you want to partition?
Prevent a single log file from being too large for easy searching.
The above is to find MSG in the same segment according to the index. How can I know which segment offset is in?
Or query and judge through dichotomy
Let’s deepen the understanding of the above words through a picture
The following are the specific commands:
#Download Kafka wget https://archive.apache.org/dist/kafka/1.0.0/kafka_2.11-1.0.0.tgz cd kafka_2.11-1.0.0 mkdir logs #Modify log directory vim config/server.properties && log.Dirs = logs #Start ZK bin/zookeeper-server-start.sh -daemon config/zookeeper.properties #Create topic bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 5 --topic liuli-test #View topic bin/kafka-topics.sh --list --zookeeper localhost:2181 #Start Kafka server ./bin/kafka-server-start.sh config/server.properties #Start Kafka producer terminal ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic liuli-test #View Kafka index file (find by offset) ./bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files ./logs/liuli-test-4/00000000000000000000.index #Viewing Kafka log files ./bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files ./logs/liuli-test-4/00000000000000000000.log --print-data-log #View time index file (find by timestamp) ./bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files ./logs/liuli-test-4/00000000000000000000.timeindex #Consumption news ./bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic liuli-test --from-beginning
Several eight part essay problems of Kafka?
How does Kafka ensure sequencing?
How did this problem arise?
The default partition strategy of Kafka is rotation training. If neither partition nor hash code is specified. The message will be sent to the partition on the server side according to the default policy. Then it is possible that the data of the same primary key enters different partition queues. The data of a single partition can be processed serially by the consumer, but not for different partition queues. Therefore, there will be a case where the messages sent later are consumed first.
How does Kafka guarantee not to lose data?
Replica mechanism, synchronous sending, manual ACK
How does Kafka guarantee data idempotency?
The so-called message delivery reliability guarantee refers to Kafka’s commitment to the messages to be processed by producer and consumer. There are three common commitments:
- At most once: the message may be lost, but it will never be sent repeatedly.
- At least once: the message will not be lost, but it may be sent repeatedly.
- Exactly once: the message will not be lost or sent repeatedly.
How does Kafka achieve accuracy once? In short, this is through two mechanisms: idempotence and transaction.
The method of specifying producer idempotency is very simple. You only need to set one parameter, that is
enable.idempotence。 When enable.idempotence is set to true, producer will be automatically upgraded to idempotent producer. The underlying principle is the classic optimization idea of exchanging space for time, that is, saving more fields on the broker side. When the producer sends messages with the same field value, the broker can automatically know that these messages have been repeated, so it can silently “discard” them in the background. However, it can only guarantee idempotency on single partition and single session.
Kafka core principle
Several concepts to understand Kafka
- Topic is used to classify messages. Each information entering Kafka will be placed under a topic
- The host server used by broker to implement data storage. Each broker is a Kafka service instance. Multiple brokers form a Kafka cluster. The messages published by producers will be saved in the broker, and consumers will pull messages from the broker for consumption
- Partition the messages in each topic will be divided into several partitions to improve the efficiency of message processing. A topic can be divided into multiple partitions. Each partition is an ordered queue, and each message in the partition has an ordered offset (offer)
- Offset。 Message displacement, which represents the location information of each message in the partition, is a monotonically increasing and constant value.
- Replica。 Replica, the same message in Kafka can be copied to multiple places to provide data redundancy. These places are called replicas. Replicas are also divided into leader replicas and follower replicas, each with different roles. Replicas are at the partition level, that is, multiple replicas can be configured for each partition to achieve high availability.
- Consumer Offset。 Consumer displacement represents consumer consumption progress. Each consumer has its own consumer displacement.
- Producer message producer
- Consumer message consumer
- Consumer group is the consumer group of messages. It is a group composed of multiple consumer instances. It consumes multiple partitions at the same time to achieve high throughput.
- Coordinator. This process is the famous “rebalance” in Kafka.
- Consumer Offset。 Consumer displacement represents consumer consumption progress. Each consumer has its own consumer displacement.
- ISR。 List of class cadres. ISR is set mainly to re elect the leader of the partition from the ISR list after the broker goes down.
- AR。 Assigned replicas, all replicas.
- Rebalance。 Rebalancing: after a consumer instance in the consumer group hangs, other consumer instances automatically reassign the subscription topic partition. Rebalance is an important means for Kafka to achieve high availability on the consumer side.
- Theme management
- Partition management
- Cluster member management
- data service
- Controller election
- Partition replica election
- Consumer election
Controller election. The controller is a broker. A Kafka cluster has multiple broker nodes. The broker leader listens to the information of other brokers, including partition status, ISR list, replica and other information. If a broker leader hangs up, the broker follower will grab the leader’s position. Whoever comes first will sit. If a broker follower hangs, the broker leader will read the information status of the suspended broker on ZK and notify other brokers. If there is a replica of the leader on the broker, the broker leader will also trigger the replica election.
The conclusion is: it selects a controller from all brokers, and the leader election of all partitions is determined by the controller. That is, select the master controller first. After the election, we will control the district election.
The controller will notify the broker that needs to respond to the change of the leader directly through RPC (more efficient than zookeeper queue)
What are the advantages?
- Prevent brain crack
- Prevent herding
Partition replica election. A partition will generate multiple replicas distributed on multiple brokers. A leader replica will be selected to be responsible for external services, and the requests received by other replicas will be transferred to the leader replica for processing.
Consumer elections. For multiple consumers in a consumption group, select a leader to coordinate the consumption conn partition. When a consumer exits, the corresponding partition of the consumer will be allocated to other consumption groups for consumption.
ISR set. In sync replicas, synchronize the replica.
AR。 Assigned replicas, all replicas.
kafka consumer group
Relationship between consumer group and consumer and topic and partition
1. A single partition under a single topic can only be subscribed by the same consumer under the same consumer group.
2. A consumer group can subscribe to multiple topics. But not recommended.
Ideally, the number of consumer instances should be equal to the total number of partitions that the group subscribes to. One consumer can also consume several partitions
Where do consumers save their displacement?
- Kafka internal
It is essentially an agreement that specifies how each consumer under the consumer group reaches an agreement to allocate each partition under the subscription topic. Re match the consumer resources with the partition queue to balance the load
When will the consumer group rebalance? There are three trigger conditions for rebalance.
- Number of members changed
- The number of subscription topics has changed
- The number of partitions for the subscription topic has changed
During rebalancing, all consumer instances will stop consuming and wait for rebalancing to complete.
When all brokers are started, the corresponding coordinator component will be created and opened. That is, all brokers have their own coordinator components.
When the consumer application submits the displacement, it actually submits the displacement to the broker where the coordinator is located. Similarly, when the consumer application starts, it also sends various requests to the broker where the coordinator is located, and then the coordinator is responsible for metadata management operations such as consumer group registration and member management records.
What are the disadvantages of rebalance? To sum up, there are three points:
- STW affects consumption speed
- It is not efficient and requires the participation of all members. Can’t you refer to a consistency hash?
To avoid unexpected group members leaving or quitting due to unreasonable parameters or logic, the main parameters related to it are:
- Session.timeout.ms (session)
- Heartbeat.interval.ms (heartbeat)
- Max.poll.interval.ms (maximum interval between two pull data)
- GC parameters
Why doesn’t Kafka let both the leader replica and the follower provide services? Support read-write separation like redis and MySQL?
- Reading and writing are consistent
- Master slave delay problem
What is the relationship between I / O model and Kafka?
In fact, the underlying layer of Kafka client uses Java selector. The implementation mechanism of selector on Linux is epoll, while the implementation mechanism on Windows platform is select. Therefore, it is advantageous to deploy Kafka on Linux at this point because it can obtain more efficient I / O performance.
Why does Kafka do the design of zoning? Isn’t it good to use multiple themes directly?
- Ability to provide load balancing
- In order to achieve high scalability of the system
- Achieve business level message ordering
Unclean leader election
Can be understood as an unclean copy of the election. The normal election is from the basic ISR replicas, because the replicas in the ISR set are synchronized with the leader. There are always some replicas with poor performance and too many replicas behind the leader. These replicas must not enter the of the ISR replica set. But what if all ISR copies hang up, leaving none. In order to ensure the availability of the cluster, the leader replica must be selected from the backward replica. This is called unclean leadership election. The disadvantage of opening unclean leader election is that it may cause data loss.
What is high water level (to be studied)
Is a state used to represent message displacement.
- Define message visibility, which is used to identify which messages under the partition can be consumed by consumers
- Help Kafka complete replica synchronization
Application of Kafka in XXX (omitted)
Geek time “Kafka core technology and actual combat” column
 MQ comparison and selection: https://note.dolyw.com/mq/00-…\
 Read Kafka zero copy from this article: https://blog.csdn.net/qq_3786…\