Kafka document (Introduction)

Time:2020-10-17

introduce

Apache Kafka? Is a distributed streaming platform. What exactly does this mean?

The streaming platform has three key functions:

  • Publish and subscribe to record streams, similar to message queuing or enterprise messaging systems
  • Store record stream in fault tolerant and persistent way
  • Process record streams as they occur

Kafka is commonly used in two broad categories of applications:

  • Build real-time stream data pipeline to obtain data reliably between systems or applications
  • Building real-time streaming applications that transform or respond to data streams

To understand how Kafka does these things, let’s explore Kafka’s capabilities from the bottom up.

The first step is to understand several concepts

  • Kafka runs as a cluster on one or more servers that can span multiple data centers
  • Kafka clusters store record streams in categories of topics
  • Each record consists of a key, a value, and a timestamp

Kafka has four core APIs:

  • The producer API allows applications to publish streams of records to one or more Kafka topics
  • The consumer API allows applications to subscribe to one or more topics and process the stream of records produced to them
  • Stream API allows applications to act as stream processors, consume input streams from one or more topics, and produce output streams to one or more output topics, effectively converting input streams to output streams.
  • The connector API allows you to build and run reusable producers or consumers to connect Kafka topics to existing applications or data systems. For example, connectors that connect to relational databases may capture all changes to tables.

Kafka document (Introduction)

In Kafka, the communication between client and server is accomplished by simple, high-performance, language independent TCP protocol. The protocol is versioned and backward compatible with the old version. We provide Kafka with a java client, but the client can use multiple languages.

Topics and logs

Let’s start by delving into the core abstraction Kafka provides for record streams – topics.

A topic is a category or feed name of a publication record. A topic in Kafka is always a multi subscriber, that is, a topic can have 0, 1 or more consumers who subscribe to the data written to it.

For each topic, Kafka cluster maintains a partition log similar to this:

Kafka document (Introduction)

Each partition is an ordered, immutable sequence of records and is constantly attached to a structured commit log. Records in the partition are assigned a continuous ID number called offset, which uniquely identifies each record in the partition.

Kafka clusters can persist all published records – whether they are consumed or not – and use configurable retention periods. For example, if the retention policy is set to two days, it can be consumed within two days after the record is published and then discarded to free up space. Kafka’s performance in terms of data size is actually constant, so storing data for a long time is not a problem.

Kafka document (Introduction)

In fact, the only metadata retained on the basis of each consumer is the consumer’s offset or position in the log, which is controlled by the consumer: usually, the consumer will increase its offset linearly when reading records, but in fact, since the location is controlled by the consumer, it can consume records in the order it likes. For example, consumers can reset to an earlier offset to reprocess past data, or jump to the most recent record and start spending now.

This combination of features means that Kafka consumers are very cheap – they can come and go without much impact on the cluster or other consumers. For example, you can use our command-line tools to “track” the content of any topic without changing the content consumed by any existing consumer.

There are several uses for the log partition in the. First, they allow logs to expand beyond the size of a single server, and each individual partition must be suitable for the server that hosts it, but a topic may have many partitions, so it can handle any number of data, and second, they are units of parallelism – more on that later.

distribution

Log partitions are distributed on servers in Kafka cluster. Each server processes data and shared partition requests, and each partition replicates on the number of configurable servers for fault tolerance.

Each partition has a server acting as a “leader” and one or more servers acting as “followers”. The leader handles all read and write requests of the partition, while the followers passively copy the leaders. If the leader fails, one of the followers will automatically become a new leader. Each server acts as the leader of some of its partitions and as the follower of other partitions The load in the cluster is very balanced.

Geographical replication

Kafka mirrormaker provides geographic replication support for clusters. With mirrormaker, messages can be replicated across multiple data centers or cloud regions. You can use it for backup and recovery in active / passive scenarios, or put data closer to users in active / active scenarios, or support data locality requirements.

producer

The producer publishes the data to the topic they choose, and the generator is responsible for selecting which record to assign to which partition in the topic. This can be done in a circular way, just to balance the load, or according to some semantic partition functions (such as based on a key in the record). For more information about using partitions, see Chapter 2!

consumer

Consumers mark themselves with consumer group names, and each record published to a topic is passed to a consumer instance in each subscription consumer group, which can be in a separate process or on a separate machine.

If all consumer instances have the same consumer group, the record will effectively load balance on the consumer instance.

If all consumer instances have different consumer groups, each record is broadcast to all consumer processes.

Kafka document (Introduction)

One Kafka cluster server with four partitions (p0-p3) and two consumer groups. Consumer group A has two consumer instances and group B has four.

However, more commonly, we find that the topic has a small number of consumer groups, each “logical subscriber” corresponds to a consumer group, and each group is composed of many consumer instances to achieve scalability and fault tolerance. This is just a publish subscribe semantics, where the subscriber is a consumer cluster rather than a process.

The way to realize consumption in Kafka is to divide the partition in the log in the consumer instance, so that each instance is the only consumer of the “fair sharing” partition at any time. The process of maintaining the membership of the group is dynamically handled by Kafka protocol. If new instances are added to the group, they will take over some partitions from other members of the group. If an instance dies out, it will The partition of will be assigned to another instance.

Kafka only provides the total order of records in a partition, not between different partitions in the topic. For most applications, the ability of sorting each partition with keys to partition the data is sufficient. However, if you need the total order on records, this can be achieved by a topic with only one partition, although this means that each consumer group only has one partition There is a consumer process.

multi-tenancy

You can deploy Kafka as a multi tenant solution, enabling multi tenancy by configuring which topics can produce or consume data, and there is also operational support for quotas. Administrators can define and enforce quotas on requests to control the proxy resources used by clients. For more information, see the security documentation.

ensure

The following guarantees are provided in advanced Kafka:

  • Messages sent by the producer to a specific topic partition will be appended in the order of sending. That is, if a record M1 is sent by the same producer as record m2, and M1 is sent first, the offset of M1 will be less than m2 and appear in front of the log.
  • A consumer instance will view the records in the order they are stored in the log.
  • For the topic of replication factor N, we can tolerate up to n-1 server failures without losing any records submitted to the log.

More details about these guarantees will be given in the design section of the document.

Kafka as a message system

How does Kafka’s flow concept compare to traditional enterprise messaging systems?

Traditionally, there are two modes of messaging: queue and publish subscribe. In the queue, the consumer pool can read data from the server, and each record will be sent to one of them. In publish subscribe, the record is broadcast to all consumers. Both modes have advantages and disadvantages. The advantage of queues is that they allow you to divide data processing across multiple consumer instances, allowing you to extend processing. Unfortunately, queues are not multisubscribed – once a process reads data, the data disappears. Publish subscribe allows you to broadcast data to multiple processes, but since each message is delivered to each subscriber, there is no way to extend processing.

The concept of consumer groups in Kafka generalizes these two concepts. Like queues, consumer groups allow you to divide processing into multiple sets of processes (members of consumer groups). Like publish and subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

The advantage of the Kafka model is that each topic has these two features – it can extend processing and is also multi subscription – without having to choose one or the other.

Compared with the traditional message passing system, Kafka also has stronger sequence guarantee.

The traditional queue keeps the order of records on the server. If multiple consumers consume from the queue, the server distributes the records in the order in which the records are stored. However, although the server distributes the records in order, the records are delivered to the consumers asynchronously, so they may arrive in an unordered way on different consumers, which actually means that the order of records will be in Lost when using in parallel. Message passing systems usually work through the concept of “exclusive consumer”, which allows only one process to consume from the queue, but this of course means that there is no parallelism in the processing.

Kafka is better. By dividing the concept of parallelism into intra topic partitions, Kafka can provide both sequence assurance and load balancing in the consumer process pool. This is achieved by assigning partitions in the topic to consumers in the consumer group, so that each partition is used by one consumer in the group. By doing so, we ensure that the consumer is the only reader of the partition and consumes the data in order. Because there are many partitions, this still balances the load of many consumer instances, but please note that in the consumer group The consumer instance of cannot exceed partitions.

Kafka as a storage system

Any message queue that allows the separation of published and consumed messages effectively serves as a storage system for running messages. Kafka is different in that it is a very good storage system.

The data written to Kafka is written to disk and copied to obtain fault tolerance. Kafka allows the producer to wait for confirmation until it is fully copied and guarantees that the write can be guaranteed even if the server being written fails.

Kafka uses the disk structure very well – whether you have 50kb or 50tb of persistent data on the server, Kafka will do the same thing.

By taking storage seriously and allowing clients to control their read locations, you can think of Kafka as a distributed file system dedicated to high-performance, low latency commit log storage, replication, and propagation.

For more information about Kafka’s commit log storage and replication design, read the design section.

Kafka’s stream processing

It is not enough to read, write and store data stream. The purpose is to make the real-time processing of stream possible.

In Kafka, a stream processor can obtain a continuous data stream from an input topic, perform some processing on the input, and generate a continuous data stream to the output topic.

For example, a retail application might receive input streams for sales and shipments and output streams for reordering and price adjustments calculated from that data.

Simple processing can be done directly using producer and consumer APIs, however, for more complex transformations, Kafka provides a complete flow API, which allows applications to be built that do not perform general processing, compute aggregation from streams or wire flows together.

The tool helps to solve the problems faced by such applications: processing out of order data, reprocessing input as code changes, performing stateful calculations, and so on.

The flow API is based on the core primitives provided by Kafka: it uses producer and consumer APIs for input, Kafka for stateful storage, and the same group mechanism for fault tolerance in the instance of stream processor.

Put the pieces together

This combination of messaging, storage, and streaming may not seem common, but it is important for Kafka’s role as a streaming platform.

Distributed file systems like HDFS allow static files to be stored for batch processing, and such systems can effectively store and process past historical data.

Traditional enterprise messaging systems allow future messages that arrive after a subscription to be processed, and applications built in this way process data as they arrive.

Kafka combines these two functions, which is critical for Kafka as a platform for streaming applications and for streaming data pipelines.

By combining storage and low latency subscription, streaming applications can process past and future data in the same way. A single application can process historical and stored data, but when it reaches the last record, it can continue processing because future data will arrive. This is a broad concept of flow processing, which includes batch processing and message driver Move the application.

Similarly, for the streaming data pipeline, the subscription combination of real-time events enables Kafka to be used in very low latency pipelines; however, the ability to reliably store data enables it to be used in critical data that must guarantee data delivery, be integrated with offline systems that only load data periodically, or may be maintained for a long time It is possible to convert on arrival.

For more information about the guarantees, APIs, and features that Kafka provides, see the rest of the documentation.