This article starts with the official account: five minutes big data.
During the interview, I found that many interviewers like to ask Kafka related questions, which is not difficult to understand. Who makes Kafka the leader of message queue in the field of big dataThe only kingWith the throughput of 100000 single machines and the delay of millisecond, who can not love this kind of natural distributed message queue?
In a recent interview, an interviewer saw that Kafka was written on the items in his resume, so he asked Kafka directly, but he didn’t ask any other questions. Let’s take a look at Kafka’s question:
(the following answers are compiled after the interview, only about one third of them were answered in the actual interview.)
1. Why use Kafka?
- Buffering and peak shaving: when upstream data has burst traffic, the downstream may not be able to handle it, or there are not enough machines in the downstream to ensure redundancy. Kafka can act as a buffer in the middle. When messages are temporarily stored in Kafka, the downstream services can process them slowly according to their own rhythm.
- Decoupling and scalability: at the beginning of a project, specific requirements cannot be determined. Message queuing can be used as an interface layer to decouple important business processes. It only needs to follow the Convention and program for the data to obtain the scalability.
- Redundancy: one to many mode can be adopted. A producer publishes messages, which can be consumed by multiple services subscribing to topic for multiple unrelated services.
- Robustness: the message queue can stack requests, so even if the consumer business dies in a short time, it will not affect the normal operation of the main business.
- Asynchronous communication: many times, users don’t want or need to process messages immediately. Message queuing provides an asynchronous processing mechanism, which allows users to put a message into the queue without immediately processing it. Put as many messages as you want into the queue, and then process them when needed.
2. How can Kafka consume the consumed information?
The offset of Kafka consumption message is defined in zookeeper. If you want to consume Kafka message repeatedly, you can record the offset checkpoints (n) in redis. When you want to consume Kafka message repeatedly, you can reset the offset of zookeeper by reading the checkpoints in redis. In this way, you can consume Kafka message repeatedly
3. Whether the data of Kafka is put on disk or memory, why the speed is fast?
Kafka uses disk storage.
The speed is fast because:
- Sequential write: because the hard disk is a mechanical structure, every read and write will be addressed and written. Addressing is a “mechanical action”, which is time-consuming. So hard disks hate random I / O and prefer sequential I / O. In order to improve the speed of reading and writing hard disk, Kafka uses sequential I / O.
- Memory mapped files: in a 64 bit operating system, a 20g data file can be represented. Its working principle is to directly use the page of the operating system to realize the direct mapping from the file to the physical memory. After the mapping, your operations on physical memory will be synchronized to the hard disk.
- Kafka efficient file storage design: Kafka divides a parition large file in topic into multiple small file segments. Through multiple small file segments, it is easy to periodically clear or delete consumed files, reducing disk occupation. Through the index information, it can locate quickly
Message and determine the size of the response. The index metadata is mapped to memory (memory mapping file),
The IO disk operation of segment file can be avoided. Through sparse storage of index file, the space occupied by index file metadata can be greatly reduced.
- One of Kafka’s methods to solve the query efficiency is to segment the data file. For example, there are 100 messages whose offset is from 0 to 99. Suppose the data file is divided into five segments, the first segment is 0-19, the second segment is 20-39, and so on. Each segment is placed in a separate data file, and the data file is named after the small offset in the segment. In this way, the
When searching for a message, you can locate the segment of the message by binary search.
- Index the data file, segment the data file so that you can find the message corresponding to offset in a smaller data file, but it still needs sequential scanning to find the message corresponding to offset.
In order to further improve the efficiency of searching, Kafka establishes an index file for each segmented data file. The file name is the same as the name of the data file, but the file extension is. Index.
4. How can Kafka data not be lost?
There are three points: one is the producer side, the other is the consumer side and the other is the broker side.
- No loss of producer data
Kafka’s ack mechanism: when Kafka sends data, every time it sends a message, there will be a confirmation feedback mechanism to ensure that the message can be received normally, in which the status is 0, 1, – 1.
For synchronous mode:
Setting ack to 0 is very risky and generally not recommended. Even if it is set to 1, data will be lost with the leader down. Therefore, if you want to strictly ensure that the production side data is not lost, you can set it to – 1.
For asynchronous mode:
The state of ACK will also be considered. In addition, in asynchronous mode, there is a buffer to control the sending of data. There are two values to control, the time threshold and the number threshold of messages. If the buffer is full and the data has not been sent out, there is an option to configure whether to clear the buffer immediately. Can be set to – 1, permanent blocking, so that data is no longer produced. In asynchronous mode, even if set to – 1. It may also be due to the unscientific operation of the programmer and the loss of operation data, such as kill – 9, but this is a special exception.
ACK = 0: producer does not wait for the confirmation of the completion of broker synchronization, and continues to send the next (batch) message.
ACK = 1 (default): producer will wait for the leader to receive the data successfully and get confirmation before sending the next message.
ACK = – 1: the producer sends the next data after being confirmed by folwer.
- No loss of consumer data
Kafka records the offset value of each consumption by itself. When it continues to consume next time, it will consume the last offset.
The offset information is saved in zookeeper before Kafka version 0.8, and in topic after Kafka version 0.8. Even if the consumer hangs up in the process of running, the offset value will be found when the consumer starts up again, the location of the previous consumption message will be found, and then the consumer will consume When the message is written, not every message is written after consumption, so this situation may cause repeated consumption, but the message will not be lost.
The only exception is that we set two consumer groups in the program that originally do different functions
KafkaSpoutConfig.bulider.setGroupid This will cause the two groups to share the same piece of data, resulting in group a consuming the messages in partition1 and partition2, and group B consuming the messages in partition3. In this way, the messages consumed by each group will be lost and incomplete. In order to ensure that each group has its own message data, the groupid must not be repeated.
- The broker data in Kafka cluster is not lost
We usually set the number of replicas for the partitions in each broker. When the producer writes, it first writes to the leader according to the distribution policy (partition according to partition, key according to key, no polling). Then the follower synchronizes the data with the leader. In this way, with backup, the message data will not be lost.
5. Why choose Kafka to collect data?
Flume, Kafka and other technologies can be used in the acquisition layer.
Flume: flume is a pipeline flow mode. It provides many default implementations, allowing users to deploy parameters and extend API
Kafka: Kafka is a persistent distributed message queue. Kafka is a very general system. You can have many producers and many consumers share multiple topics.
Flume, by contrast, is a special tool designed to send data to HDFS, HBase. It has special optimization for HDFS and integrates the security features of Hadoop.
Therefore, cloudera suggests using Kafka if the data is consumed by multiple systems, and flume if the data is designed for Hadoop.
6. Will Kafka restart cause data loss?
- Kafka writes the data to disk, so the general data will not be lost.
- However, in the process of restarting Kafka, if there are consumer consumption messages, Kafka may cause inaccurate data (loss or repeated consumption) if it is too late to submit offset.
7. How to solve Kafka downtime?
- Consider whether the business is affected first
Kafka is down. First of all, we should consider whether the service provided is affected by the down machine. If there is no problem in service provision, and if the disaster recovery mechanism of the cluster is well implemented, then there is no need to worry about this.
- Node debugging and recovery
If you want to recover the cluster nodes, the main step is to check the causes of the node downtime through log analysis, so as to solve the problem and recover the nodes again.
8. Why does Kafka not support read-write separation?
In Kafka, the operations of producer writing message and consumer reading message interact with the leader copy, so it is a kind of communicationWrite and readThe model of production and consumption is established.
Kafka doesn’t support itMaster writer and slave readerBecause there are two obvious disadvantages of master write and slave read
- Data consistency problem: data from the master node to the slave node will inevitably have a delay time window, which will lead to data inconsistency between master and slave nodes. At a certain time, the value of a data in the master node and slave node is x, and then the value of a in the master node is modified to y. before the change is notified to the slave node, the value of a data read by the application in the slave node is not the latest y, which leads to the problem of data inconsistency.
- Delay problem: for components like redis, the process from writing data to master node to synchronizing data to slave node needs to go through network → master node memory → network → slave node memory. The whole process will take a certain amount of time. In Kafka, master-slave synchronization is more time-consuming than redis. It needs to go through the following stages: network → master node memory → master node disk → network → slave node memory → slave node disk. For delay sensitive applications, the master write slave read function is not suitable.
And Kafka’sWrite and readThere are a lot of advantages in this respect
- It can simplify the implementation logic of the code and reduce the possibility of errors;
- Compared with master write and slave read, it not only has better load efficiency, but also can be controlled by users;
- There is no delay effect;
- When the replica is stable, there will be no data inconsistency.