Kafka adventure – Introduction to architecture

Time:2021-3-8

For this Kafka project, I will focus on the overall architecture of the system, design and code implementation. With you bar source code, learn skills, knowledge. I hope you continue to pay attention and witness growth together!
I believe: the road of technology, ten years as one day! Ten years to sharpen a sword!

brief introduction

Kafka is a distributed, publish / Subscribe Based messaging system. It was originally developed by LinkedIn and opened in early 2011. In October 2012, it broke out of Apache incubator and became the top project of Apache.

Kafka was originally designed for LinkedIn traffic and operation data analysis. Traffic data includes PV (page view), UV (unique visitor), search data, detail page data, etc. In the high concurrency scenario, the statistics of these data are not real-time, not simple, such as + 1 for the data volume of a certain field in the database. Under the huge traffic peak, the main business process cannot be blocked because of the statistical data. Therefore, these data are usually recorded in the file or big data storage engine, and then analyzed periodically.

Kafka is favored by more and more companies mainly because of its characteristics and advantages

  • With O (1) time complexity, message persistence can ensure o (1) access efficiency for TB level data
  • Batch supportreadAndwriteAnd the data is compressed to ensure high throughput
  • Support message partition, distributed sending, distributed consumption, easy to scale out, with high concurrency

Application scenarios

So why do you need to use message queuing, or in what scenario is Kafka more appropriate

decoupling

In the scenario of big data and high concurrency, in order to break through the performance bottleneck, the system will be expanded horizontally and split vertically, and a complex system will be split into several independent and pure subsystems. Data flows between systems, but if the processing speed of a service is too slow, it will drag down the performance of the whole link, form a bottleneck, reduce the performance of the whole system, and cause the situation of “drought dead, flood dead”.

Take a simple example: when Taobao places an order, the transaction system will complete the deduction, and there will be many subsequent actions: remind the seller to deliver goods, generate the seller workflow, write off coupons, increase shopping points, etc. if this step is written into the deduction code of the transaction system, it is very likely that the transaction system will be dragged to death, and the failure of any downstream link will also cause the deduction to be rolled back, And if you need to add a new action, need to trade to do a lot of changes, the design is certainly unreasonable. In fact, after the transaction system processes the deduction, it will send a deduction completion message, which can be received by the downstream. The failure of the downstream will not affect the failure of the core process, and the boundary of each system is clearer and the hierarchy is more reasonable.

Data persistence

Today’s applications are basically related to the docking between multiple systems. Data is transferred between systems through RPC. Failure of data processing will lead to data loss unless the data is persisted to disk. Kafka, on the other hand, keeps all the data that needs to be transferredPersistent to diskTo ensure that the data will not be lost. There is also a very important ability is to keep the scene for follow-up problem investigation and tracking. Only those who have experienced system failure but can not reproduce will feel the pain!

In order to ensure that the data on the disk will not explode, Kafka provides data cleaning, data compression and other functions to clear the historical data.

Expansibility

In the case of the application access surge, the code optimization is often not as timely as the direct horizontal expansion. Diagnosis, analysis, solution, optimization, verification of a series of complex processes, so that code optimization seems to be a long-term solution. At this time, the hemostasis scheme can only be degraded, current limiting and machine expanding. The expansibility of Kafka is mainly reflected in the ability to expand the capacity without modifying the parameters and code. The expansion is completed by machine registration service. Not all systems have this imageAdjust the volume knob as simple as improving system performanceThis will involve a series of complex processes, such as whether there will be hot spots in the data before expansion, synchronization of new nodes to the cluster, traffic redistribution, etc.

Disaster recovery

The failure of some components of the system will not affect the operation of the system. Message queuing reduces the coupling degree between processes. The failure of upstream or downstream services will not affect the operation of other systems. After the service is re online, it can continue to process the data that has not been processed before, but there will be a certain delay, but it can be guaranteedFinal business correctness 。

Keep order

Brother Qiang: do you keep this melon ripe? Oh, no, are you in the queue?
In most scenarios, the order of data processing is crucial, and disorder of order may lead to wrong data results. Unless the process is stateless, the message only triggers the event to trigger the downstream calculation. Kafka can ensure the internal order of partition, but not the global order.

Core concepts

Architecture diagram

Kafka adventure - Introduction to architecture

The figure above is a typical Kafka architecture diagram. On the left is the message producer, which sends messages to a specific topic. Due to the distributed design of Kafka, each topic is divided into multiple partitions, so the messages sent to each topic will be stored in the corresponding partition. In addition, if a replica is set for a topic, each partition will have a corresponding replica. These topics are subscribed by different consumers. If two consumers are in the same consumer group, they can only subscribe to a fixed partition.

Taking topic a as an example, producer 1 sends a message to topic-a. the message will be stored in two partitions of broker-2 and broker-3. Since topic-A enables partition backup, each partition will be backed up by another node, topic-A ‘. The data sent to broker will be subscribed by consumers. Because consumer-1 and consumer-2 are in the same consumer group, they can only consume messages of a fixed partition. Consumer-1 will only receive messages of topic-A partition-1, and consumer-2 will only receive messages of topic-A partition-0.

Broker

In Kafka cluster, a Kafka server is a broker. The producer delivers messages to the broker, and the broker guarantees message persistence, disaster recovery, accuracy, etc. At the same time, it accepts the message subscription of consumers and distributes messages to consumers. Generally speaking, in a production environment, a Kafka server is a broker.

Topic & Partition & Log

Topic can be regarded as a logical concept used to store messages, but it can be simply regarded as a logical conceptmailbox 。 When sending a message, you need to specify which topic to send to, and when the message is consumed, you also need to specify which topic to consume the message.

In order to improve the scalability and throughput of Kafka, the topic is divided into multiple partitions, and each partition corresponds to a log. Log is a logical concept, which corresponds to a folder on the server, where all the information data and message indexes under the partition are stored. In the face of massive data, Kafka divides log into multiple segments to avoid I / O bottleneck. Each segment containsLog fileAndIndex fileThe file is named after the offset of the first message of the segment. In fact, it’s very difficult to say this. If you look at the architecture diagram below, you can pay close attention to the signs and numbers of each part. Combined with this passage, it should be very easy to understand.

In addition, because Kafka uses sequential I / O, the efficiency of sequential I / O is very high, even higher than that of random write memory, which is one of the reasons for Kafka’s high performance.Kafka adventure - Introduction to architecture

Replication

In the production environment, we usually turn on the Kafka message redundancy feature. Each partition has one or more copies, which we call replication. When a partition has only one copy, only one copy of the partition’s data is retained. Each partition copy selects a leader,Leader is the "interface person" of all read-write requestsThe remaining copies are follower. Follower has two functions: to pull the leader’s log data to do the followingbackups, as a candidate after the leader failsParticipate in leader election 。

Producer

The source of message output is pushed to each partition of topic through certain strategies. The push strategy mentioned here is the message routing mechanism. Kafka has a variety of built-in strategies. For example, according to the message key, rotation training and so on. Users can even write extension code to define the routing strategy.

Consumer & Consumer Group

ConsumerThe main work is to pull messages from the broker for consumption processing. Each consumer maintains his own consumption progress. This design has many advantages, such as: each consumer progress can be easily distinguished, and the consumption site of a single consumer can be modified to skip or re consume some messages, avoiding the single point of failure problem of centralized management of site information.

Most of today’s applications are distributed systems. An application has dozens or hundreds of servers running the same code. If a message comes and each server executes the consumption logic once, wouldn’t it cause a huge problem.

So Kafka introduces a new concept:Consumer group 。 We can put the servers of these applications in the same consumer group, and Kafka stipulates that a message can only be consumed by one consumer in the same consumer group, so that the problem of repeated consumption in the distributed situation can be perfectly avoided. The above situation is simply to realize that the message is exclusive to a certain server, that is to sayunicastQuestions. If we want this message to be broadcast, every server that receives this message will process it, for example, send a message to clean up the log. This situation is calledradio broadcastThen we just need to put each consumer into a different consumer group.

Kafka introduces the concept of consumer group to solve the problem of unicast and broadcast skillfully, without distinguishing subscription types. It uses a logical concept to shield multiple subscription implementations.

In addition, the partition of consumers’ subscription in the same consumer group is determined, and reallocation will be carried out only when the consumers in the consumer group change. For example, if we have four partitions and three consumers, one consumer will subscribe to two partitions. However, if there are three partitions and four consumers, some consumers will be idle, causing waste. Therefore, the number of general consumers should not be greater than the number of topic partitions.Kafka adventure - Introduction to architecture

End (nagging)

This is my first blog in 2021. When I review it at the end of the year, I know how bad I was last year. There is neither input nor output. Although the work has entered a new stage and will become busier and busier, being busy is not an excuse for refusing to grow up. We must ensure the input of one or two books a month and output a high-quality article a week or two.The longest road belongs to a lonely heart, to share with you 

In the next issue, I will sort out Kafka producers as a whole, including message sending clients and data caching at the sending end. From the perspective of source code, I will look at the design patterns and code organization skills.