Apache Kafka is a distributed publish subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a fast, scalable, inherently distributed, partitioned and replicable commit log service.
Its architecture includes the following components:
- Topic: is a specific type of message flow. The message is a payload of bytes, and the topic is the taxonomy name or feed name of the message.
- Producer: any object that can post messages to a topic.
- Service Broker: published messages are stored in a set of servers, called brokers or Kafka clusters.
- Consumer: you can subscribe to one or more topics and pull data from the broker to consume these published messages.
Kafka storage strategy
1) Kafka uses topics to manage messages. Each topic contains multiple partitions. Each partition corresponds to a logical log and consists of multiple segments.
2) Each segment stores multiple messages (see the figure below). The message ID is determined by its logical location, that is, the message ID can be directly located to the storage location of the message, avoiding the additional mapping from ID to location.
3) Each part corresponds to an index in memory, recording the offset of the first message in each segment.
4) The messages sent by the publisher to a topic will be evenly distributed to multiple partitions (or according to the routing rules specified by the user). The broker receives the published message and adds the message to the last segment of the corresponding partition. When the number of messages on a segment reaches the configured value or the message publishing time exceeds the threshold, the messages on the segment will be flushed to the system Disk, only the message subscribers who flush to the disk can subscribe to it. When the segment reaches a certain size, it will not write data to the segment, and the broker will create a new segment.
Kafka deletion policy
1) Deleted n days ago.
2) Keep the most recent MGB data.
Unlike other message systems, Kafka broker is stateless. This means that consumers must maintain the status information of consumption. This information is maintained by consumers themselves, and the broker doesn’t care about it.
- Deleting a message from a broker becomes tricky because the agent does not know if the consumer has used the message. Kafka creatively solves this problem by applying a simple time-based SLA to retention policies. When the message has been in the agent for a certain period of time, it will be automatically deleted.
- This innovative design has the great advantage that consumers can intentionally go back to the old offset and consume data again. This violates the common conventions of queues, but has proven to be a fundamental feature of many consumers.
The following is an excerpt from Kafka’s official documents:
1) High throughput to support high capacity event flow processing
2) Support data loading from offline system
3) Low latency message system
1) Dependent on file system, persistent to local
2) Data persistence to log
1) Solve “small IO problem”:
Use “message set” to combine messages.
The server uses “chunks of messages” to write to the log.
The consumer gets a large message block at a time.
2) Solve “byte copying”:
A unified binary message format is used among producer, broker and consumer.
Use the system’s pagecache.
Use sendfile to transfer log to avoid copying.
End to end batch compression
Kafka supports gzip and snappy compression protocols.
1) Producer can customize the routing rules of which partition to send. Default routing rule: hash (key)% numpartitions. If the key is null, a partition will be selected randomly.
2) Custom Routing: if the key is a user ID, messages from the same user can be sent to the same partition, and then the consumer can read messages from the same partition.
Asynchronous batch sending
Batch sending: configure data that is not more than a fixed number of messages sent together and the waiting time is less than a fixed delay.
The consumer controls the reading of messages.
Push vs Pull
1)producer push data to broker，consumer pull data from broker
2) Advantages of consumer pull: consumer controls the reading speed and quantity of messages by itself.
3) Disadvantages of consumer pull: if the broker does not have data, it may have to pull several busy wait times. Kafka can configure consumer long pull until there is data.
1) Most messaging systems record which messages are consumed by the broker, but Kafka is not.
2) Kafka controls the consumption of messages by the consumer. The consumer can even return to an old offset position to consume messages again.
Message Delivery Semantics
At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but may be redelivered.
Exactly once—this is what people actually want, each message is delivered once and only once.
Producer: there is an “acks” configuration that can control when the received leader responds to the successful writing of producer message.
*Read the message, write the log, and process the message. If the message processing fails and the log has been written, the failed message cannot be processed again, corresponding to “atmosphere once”.
*Read the message, process the message, and write the log. If the message processing is successful and log writing fails, the message will be processed twice, corresponding to “at least once”.
*Read the message, process the message at the same time, and write the result and log at the same time. This ensures that result and log are updated at the same time or fail at the same time, corresponding to “exactly once”.
Kafka guarantees at least once delivery by default, allowing users to implement at most once semantics. The implementation of exactly once depends on the destination storage system. Kafka provides read offset, and the implementation is no problem.
1) The replication factor of a partition includes the leader itself of the partition.
2) All reading and writing to the partition are through the leader.
3) Followers get the log (message and offset) on the leader by pull
4) If a follower hangs up, gets stuck, or synchronizes too slowly, the leader will delete the follower from the “in sync replicas” (ISR) list.
5) When all “in sync replicas” followers write a message to their own log, the message is considered “committed”.
6) If all the replication nodes for a partition are hung, Kafka selects the first resurrected node as the leader (this node is not necessarily in ISR).
1) For the partition of a topic, compression makes Kafka know at least the last value corresponding to each key.
2) Compression does not reorder messages.
3) The offset of the message does not change.
4) The offset of the message is sequential.
Consumer Offset Tracking
1) The high level consumer records the maximum offset consumed by each partition and periodically commits to the offset Manager (broker).
2) Simple consumer needs to manage offset manually. Now the simple consumer Java API only supports commit offset to zookeeper.
Consumers and Consumer Groups
1) Consumer registered with zookeeper
2) Consumers belonging to the same group (same as group ID) allocate partitions equally, and each partition will be consumed by only one consumer.
3) Consumer rebalance occurs when the status of the broker or other consumers in the same group changes.
Zookeeper coordinated control
1) Manage the dynamic join and leave of broker and consumer.
2) Trigger load balancing. When the broker or consumer joins or leaves, the load balancing algorithm will be triggered to balance the subscription load of multiple consumers in a consumer group.
3) Maintain consumption relationship and consumption information of each partition.