Hello, I’m Kafka. Maybe many people have heard of me. I was born in LinkedIn in 2011. My functions have become more powerful since then. As an excellent and complete platform, you can store a huge amount of data redundantly on me. I have a message bus with high throughput (millions / s), on which you can process the data passing through me in real time.
If you think I only have the above characteristics, then you are really too superficial.
Although the above is very good, it doesn’t touch my core. Here I’ll give you a few keywords: distributed, horizontally scalable, fault tolerant, submit log.
I will explain these abstract words one by one and tell you how I work.
Inner monologue: originally, I wanted to write this article in the first person, but I found that I could only write the above, and I couldn’t hold back any more. So I decided not to embarrass myself, but to write it in the third person
Distributed system is composed of several running computer systems. All these computers work together in a cluster, which is just a single node for end users.
Kafka is also distributed, because it stores, receives and sends messages on different nodes (also known as brokers). The advantage of this is that it has high scalability and fault tolerance.
Before that, let’s see what is vertical scalability. For example, if you have a traditional database server, it starts to overload. The solution to this problem is to add configuration (CPU, memory, SSD) to the server, which is called vertical scalability. But there are two big disadvantages to this approach
- There are limitations in hardware, so it is impossible to add unlimited machine configuration
- It requires downtime, which is usually intolerable by many companies
Horizontal scalability is to solve the same problem by adding more machines. Adding new machines does not require downtime, and there is no limit on the number of machines in the cluster. The problem is that not all systems support horizontal scalability because they are not designed for use in clusters (where work is more complex).
The most fatal problem in non distributed systems is single point failure. If your only server fails, I believe you will crash.
The design of distributed system is to allow failure by configuration. In a five node Kafka cluster, you can still work even if two of the nodes fail.
It should be noted that fault tolerance is directly related to performance. The higher the fault tolerance of your system, the worse the performance.
Commit log (also known as prewrite log or transaction log) only supports additional persistent ordered data structure. You cannot modify or delete records. It reads from left to right and ensures the order of logs.
Do you think Kafka’s data structure is so simple?
Yes, in many ways, this data structure is the core of Kafka. The records of this data structure are orderly, and orderly data can ensure our processing flow. These two problems are very important in distributed system.
Kafka actually stores all the messages on disk and sorts them in the data structure for utilizationSequential disk read。
- Both read and write are constant time o (1) (when record ID is determined), which is a huge advantage over O (log n) operations of other structures on disk, because each disk search is time-consuming.
- Read and write do not affect each other, write does not lock read, and vice versa.
These two points have a huge advantage, because the data size and performance are completely separated. Whether you have 100kb or 100TB of data on your server, Kafka has the same performance
How to work
The producer sends messages (records) to the Kafka server (broker). These messages will be processed by other applications (consumers). These messages are stored in the topic, and consumers subscribe to the topic to receive new messages. Does it feel like the code you usually write – producer consumer model.
As topics become very large, they are divided into smaller partitions for better performance and scalability (for example, messages sent by users to each other are stored, and you can split them according to the first letter of the user name). Kafka guarantees that all messages in the partition are sorted according to their order. The way to distinguish a specific message is by its offset. You can think of it as a normal array index, that is, the sequence number that is incremented for each new message in the partition.
Kafka abides by the rules of stupid brokers and smart consumers. This means that Kafka will not track which records consumers read and delete, but will store them for a certain period of time (for example, one day, which starts with log. Retention to determine the log retention time) until a certain threshold is reached. Consumers poll Kafka for new messages and tell them which records they want to read. This allows them to increment / decrement their offsets as they wish, enabling them to replay and reprocess events.
It should be noted that consumers belong to the consumer group, which has one or more consumers. To avoid two processes reading the same message twice, each partition can only be accessed by one consumer in a consumer group.
Persistent to hard disk
As I mentioned earlier, Kafka actually stores all the records on the hard disk without saving anything in RAM. You want to know how to make this choice. In fact, there are many optimizations behind it to make this scheme feasible.
- Kafka has a protocol for grouping messages, which allows network requests to group messages together and reduce network overhead. In turn, the server keeps a large number of messages at one time, and the consumer gets a large number of linear blocks at one time.
- Linear read and write on disk is very fast. The modern concept of very slow disk is due to a large number of disk addressing, but it is not a problem in a large number of linear operations.
- The operating system optimizes a lot of linear operations by pre reading (prefetching large blocks many times) and post writing (composing small logical writes into large physical writes).
The operating system caches disk files in free RAM. This is called pagecache, and Kafka uses pagecahce in reading and writing
- When writing a message, the message first goes from Java to page cache, then the asynchronous thread swipes the disk, and the message swipes the disk from page cache
- When reading a message, first find it from page cache, then directly transfer it to socket if there is one, and then load it from disk to page cache if not, and then send it directly from socket
- Because Kafka stores messages in unmodified standardized binary format throughout the process (producer > broker > consumer), it can use zero copy optimization. At that time, the operating system directly copied data from pagecache to socket, effectively bypassing Kafka broker.
All these optimizations enable Kafka to deliver messages at a speed close to the network.
Data distribution and replication
Let’s talk about how Kafka implements fault tolerance and how it allocates data between nodes.
In order to keep the data when a broker fails, partition data is replicated in multiple brokers.
At any time, a broker has a partition, through which applications read / write. This node is called partition leader. It copies the received data to n other brokers. The brokers receiving the data are called followers, which also store the data. Once the leader node dies, they are ready to compete to become the leader.
This ensures that the messages you successfully publish will not be lost. By choosing to change the replication factor, you can exchange the performance according to the importance of the data for better persistence
But you may ask: how does producer or consumer know who the partition leader is?
For producer / consumer write / read requests to partitions, they need to know which partition leader is, right? This information must be available. Kafka uses zookeeper to store the metadata.
What is zookeeper
Zookeeper is a distributed key value store. It is highly optimized for reading, but slow for writing. It is the most commonly used mechanism for storing metadata and handling clusters (heartbeat, distribution of updates / configuration, etc.).
It allows the client of the service (Kafka broker) to subscribe and send it to them after the change, which is how Kafka knows when to switch the partition leader. Zookeeper itself maintains a cluster, so it has high fault tolerance. Of course, it should also have. After all, Kafka depends on it to a large extent.
Zookeeper is used to store all metadata information, including but not limited to the following items:
- The offset of each partition of the consumer group (now the client stores the offset on a separate Kafka topic)
- ACL — authority control
- Producer / consumer traffic control — the data size of production / consumption per second. For referenceKafka flow control quota function
- Partition leaders and their health information
How does the producer / consumer know who is the partition leader?
Producers and consumers used to connect zookeeper directly to get this information, but Kafka removed this strong coupling from 0.8 and 0.9. The client gets the metadata directly from Kafka broker, and let Kafka broker get the metadata from zookeeper.
For more information on zookeeper, please refer to:Comics: what is zookeeper?
In Kafka, stream processor is to obtain continuous data stream from input topic, perform some processing on this input and generate data stream to output to other topics (or external services, databases, containers, etc.)
What is data flow? First, data flow is an abstract representation of unbounded data sets. Boundlessness means unlimited and sustained growth. Unbounded datasets are infinite because new records are added over time. For example, credit card transactions, stock transactions and other events can be used to represent data flow
We can use the API of Producer / consumer for simple processing directly, but Kafka provides integration for more complex transformations, such as connecting streams togetherStream APIlibrary
This API is used in your own code. It doesn’t run on broker. Its working principle is similar to that of consumer API. It can help you expand flow processing on multiple applications (similar to consumer group).
The stateless processing of flow is deterministic processing, which does not depend on any external conditions. For any given data, it will always generate the same output independent of any other content. For example, we need to do a simple data conversion — “Zhangsan” — > “Hello, Zhangsan”
Stream table ambiguity
It is important to realize that flow and table are essentially the same. Flow can be interpreted as table, and table can also be interpreted as flow
Flow as table
Flow can be interpreted as a series of data updates, and the aggregated result is the final result of the table. This technology is called event traceability（Event Sourcing)
If you know about database backup synchronization, you will know that their technical implementation is called streaming replication, which sends every change to the table to the replica server, such as AOF in redis and binlog in MySQL
The Kafka flow can be interpreted in the same way – events that accumulate to form the final state. This kind of stream aggregation is stored in the local rocksdb (by default) and is called ktable.
Table as flow
You can think of a table as a snapshot of the latest value of each key in the flow. Just as flow records can generate tables, table updates can generate change log flows.
Some operations that we often use in Java, such as map () or filter (), are stateless and do not require you to keep any original data. But in reality, most operations are stateful (such as count ()), because you need to store the current cumulative state.
The problem with maintaining state on a stream processor is that the stream processor may fail! Where do you need to stay in this state to be fault tolerant?
A simple way is to simply store all the state in a remote database and connect to it through the network. The problem is that a large amount of network bandwidth will slow down your application. A more subtle but important issue is that the uptime of your stream processing job will be tightly coupled to the remote database, and the job will not be self-contained (other teams may disrupt your processing by changing the database).
So what’s a better way?
Think back to the duality of tables and streams. This allows us to convert the stream to a table in the same location as our processing. It also provides us with a mechanism to deal with fault tolerance by storing the flow in Kafka broker.
The stream processor can keep its state in a local table, such as rocksdb, which is updated from the input stream (possibly after some arbitrary transition). When a process fails, it can recover its data by replaying the stream.
You can even use the remote database as a producer of streams to effectively broadcast the change logs used to rebuild tables locally.
Usually, we have to write stream processing in the JVM language, because this is the only official Kafka streams API client.
In April 2018, ksql was released as a new feature, which allows you to write simple stream jobs in a familiar SQL like language. You install the ksql server and query and manage it interactively through the CLI. It uses the same abstraction (kstream and ktable), which ensures the same advantages (scalability, fault tolerance) of streams API, and greatly simplifies the work of streams.
This may not sound like a lot, but in practice, it is more useful for testing content, and even allows people outside of development (such as product owners) to use flow processing. See the article about flow provided by fluentThe use of ksql
When to use Kafka
As we’ve already introduced, Kafka allows you to access a large number of messages and store them on a centralized medium without worrying about performance or data loss.
This means that it is very suitable to be used as the core of the system architecture, as a centralized media connecting different applications. Kafka can be a central part of event driven architecture, so you can really separate applications from each other
Kafka allows you to easily separate communication between different (micro) services. With the streams API, it is now easier than ever to write business logic to enrich Kafka topic data for service use. It’s very possible. I urge you to explore how the company uses Kafka.
Apache Kafka is a distributed streaming media platform, which can handle trillions of events every day. Kafka provides low latency, high throughput, fault-tolerant publish and subscribe pipeline, and can handle event flow. We review its basic semantics (producer, agent, consumer, theme), understand some of its optimizations (pagecache), understand its fault tolerance by copying data, and introduce its growing powerful streaming function. Kafka has been widely adopted by thousands of companies around the world, including a third of the Fortune 500 companies. With the active development of Kafka and the recent release of the first major version 1.0 (November 1, 2017), it is predicted that this streaming media platform will be as important as the relational database as the core of the data platform. I hope this introduction will help you get familiar with Apache Kafka.