Kafka actual combat: (4) detailed explanation of Kafka stream

Time:2021-3-23

1、 Background of Kafka stream

1. Introduction to Kafka stream

It provides the function of stream processing and analyzing the tree stored in Kafka

Kafka stream features:

  • Kafka stream provides a very simple and lightweight library, which can be easily embedded in any Java application, or packaged and deployed in any way
  • There are no external dependencies except Kafka
  • Make full use of Kafka partition mechanism to achieve horizontal expansion and sequence guarantee
  • Efficient state operation (such as windowed join and aggregation) through fault tolerant state store
  • Support just one processing semantics
  • Provides record level processing power to achieve low latency in milliseconds
  • Late arrival of records is supported
  • At the same time, the underlying processing primitives processor (similar to storm’s out and bolt) and high-level Abstract DSL (similar to spark’s map / group / reduce) are provided

2. Flow calculation

In the flow computing model, the input is continuous, which means that we can never get the full amount of data to do the calculation. At the same time, the results are continuously output, that is to say, the results are unbounded in time. Streaming computing generally requires high real-time performance. At the same time, it defines the target computing first, and then applies the computing logic to the data after the data arrives. At the same time, in order to improve the calculation efficiency, incremental calculation is often used instead of full calculation.
Kafka actual combat: (4) detailed explanation of Kafka stream
In the batch processing model, there is usually a full data set first, then the calculation logic is defined, and the calculation is applied to the full data. It is characterized by full calculation, and the calculation results are output at one time.
Kafka actual combat: (4) detailed explanation of Kafka stream

3. Why Kafka stream

At present, there are many streaming systems. The most famous and widely used open source streaming systems are spark streaming and Apache storm. Apache storm has developed for many years and is widely used. It provides record level processing capability. Currently, it also supports SQL on stream. Based on Apache spark, spark streaming is very convenient to integrate with graph computing and SQL processing, and has powerful functions. For users who are familiar with other spark application development, it has a low threshold. In addition, the current mainstream Hadoop distributions, such as MAPR, cloudera and hortonworks, integrate Apache storm and Apache spark, making deployment easier.

Since Apache spark and Apache storm have so many advantages, why do they need Kafka stream? The main reasons are as follows

First, spark and storm are both stream processing frameworks, while Kafka stream provides a stream processing class library based on Kafka. The framework requires developers to develop the logic part in a specific way for the framework to call. It is difficult for developers to understand the specific operation mode of the framework, which makes the debugging cost high and the use limited. Kafka stream, as a stream processing class library, directly provides specific classes for developers to call. The running mode of the whole application is mainly controlled by developers, which is convenient for use and debugging.
Kafka actual combat: (4) detailed explanation of Kafka stream
Second, although cloudera and hortonworks facilitate the deployment of storm and spark, the deployment of these frameworks is still relatively complex. Kafka stream, as a class library, can be easily embedded in applications. It has no requirements for application packaging and deployment. More importantly, Kafka stream makes full use of Kafka’s partition mechanism and consumer’s rebalance mechanism, so that Kafka stream can be easily expanded horizontally, and each instance can use different deployment methods. Specifically, each application instance running Kafka stream contains Kafka consumer instances, and multiple instances of the same application process data sets in parallel. For example, some instances can run in the web container, and some instances can run in docker or kubernetes.

Thirdly, as far as stream processing system is concerned, it basically supports Kafka as data source. For example, storm has a special Kafka spotout, while spark also provides a special spark streaming Kafka module. In fact, Kafka is basically the standard data source of mainstream stream processing system. In other words, most streaming systems have deployed Kafka, so the cost of using Kafka stream is very low.

Fourth, when using storm or spark streaming, resources need to be reserved for the process of the framework itself, such as storm’s supervisor and spark on Yard’s node manager. Even for application instances, the framework itself will occupy some resources. For example, spark streaming needs to reserve memory for shuffle and storage.

Fifthly, because Kafka itself provides data persistence, Kafka stream provides the ability of rolling deployment, rolling upgrade and recalculation.

Sixth, because of the Kafka consumer rebalance mechanism, Kafka stream can dynamically adjust the parallelism online.

2、 Kafka stream architecture

1. Overall architecture of Kafka stream

The overall architecture of Kafka stream is shown below.
Kafka actual combat: (4) detailed explanation of Kafka stream

At present, the data source of (Kafka 0.11.0.0) Kafka stream can only be Kafka as shown in the figure above. However, the processing results do not have to be output to Kafka as shown in the figure above. In fact, both kstream and ktable instantiations need to specify topic.

KStream<String, String> stream = builder.stream("words-stream");
KTable<String, String> table = builder.table("words-table", "words-store");

In addition, the consumer and producer in the figure above do not require developers to display instantiation in the application, but are implicitly instantiated and managed by Kafka stream according to the parameters, thus reducing the use threshold. Developers only need to focus on developing the core business logic, which is the part of the task in the figure above.

2、Processor Topology

The business logic of Kafka stream based streaming application is implemented in a place called processor topology. Similar to storm’s topology and spark’s DAG, it defines the flow mode of data among various processing units (called processors in Kafka stream), or the processing logic of data.

3. Kafka stream parallel model

In the parallel model of Kafka stream, the minimum granularity is task, and each task contains all processors of a specific sub topology. Therefore, the code executed by each task is exactly the same, and the only difference is that the datasets processed are complementary. This is totally different from storm’s topology. Each task in storm’s topology contains only one instance of a blowout or bolt. Therefore, different tasks in a topology of storm need to transfer data through network communication, while the task of Kafka stream contains a complete sub topology, so there is no need to transfer data between tasks, so there is no need for network communication. This reduces the system complexity and improves the processing efficiency.

If there are multiple input topics in a stream (for example, two topics, one with 4 partitions and the other with 3 partitions), the total number of tasks is equal to the number of partitions of the topic with the largest number of partitions (max (4,3) = 4). This is because Kafka stream uses the rebalance mechanism of the consumer, and each partition corresponds to a task.

The following figure shows the parallel model of Kafka stream application with two topics (4 partitions) as data sources in an instance. As can be seen from the figure, since the default thread number of Kafka stream application is 1, all four tasks run in one thread.

Kafka actual combat: (4) detailed explanation of Kafka stream
To take full advantage of multithreading, you can set the number of threads in Kafka stream. The following figure shows the parallel model with 2 threads.

Kafka actual combat: (4) detailed explanation of Kafka stream
As mentioned earlier, Kafka stream can be embedded into any Java application (theoretically, any application based on JVM can). The following figure shows the parallel model when starting the same Kafka stream application in different processes of the same machine at the same time. Note that we need to ensure the stability of the two processes StreamsConfig.APPLICATION_ ID_ Config is exactly the same. Because Kafka stream will apply_ ID_ Config is the group ID of the implicitly started consumer. Only application is guaranteed_ ID_ Only if the config is the same can the consumers of the two processes belong to the same group, so that the complementary datasets can be obtained through the consumer rebalance mechanism.
Kafka actual combat: (4) detailed explanation of Kafka stream
Now that multi process deployment is implemented, multi machine deployment can be implemented in the same way. The deployment mode also requires the application of all processes_ ID_ Config is exactly the same. It can also be seen from the figure that the number of threads in each instance is not required to be the same. However, the total number of tasks is always consistent regardless of deployment.
Kafka actual combat: (4) detailed explanation of Kafka stream

Here is a comparison between Kafka stream’s processor topology and storm’s topology.

  • Storm’s topology is composed of spin and bolt. Spin provides data source, while bolt provides calculation and data export. Kafka stream’s processor topology is completely composed of processors, because its data is fixed and provided by Kafka’s topic.
  • Different bolts of storm run in different executors, probably located in different machines, and need to transmit data through network communication. Different processors of Kafka stream’s processor topology run in the same task, which means they are in the same thread without network communication.
  • The topology of storm can contain both shuffle and non shuffle parts, and often a topology is a complete application. However, a physical topology of Kafka stream only contains non shuffle parts, and shuffle parts need to be displayed through the through operation, which divides a large topology into two sub topologies.
  • In the topology of storm, the parallelism of different bolt / spoots can be different, while in the sub topology of Kafka stream, the parallelism of all processors is exactly the same.
  • A task in storm contains only one instance of a blowout or bolt, while a task in Kafka stream contains all processors of a sub topology.

4、KTable vs. KStream

Ktable and kstream are two very important concepts in Kafka stream. They are the basis of Kafka to realize various semantics. Therefore, it is necessary to analyze the difference between the two.

Kstream is a data stream. It can be considered that all records are inserted into this data stream through insert only. Ktable represents a complete data set, which can be understood as a table in the database. Since each record is a key value pair, the key can be understood as the primary key in the database, while the value can be understood as a row of records. It can be considered that the data in ktable is entered through update only. This means that if the key of the newly entered data in the topic corresponding to ktable already exists, only the last data corresponding to the same key will be retrieved from ktable, which is equivalent to the new data updating the old data.

Take the following figure as an example, suppose there is a kstream and ktable, which are created based on the same topic, and the topic contains five pieces of data as shown in the figure below. At this point, traversing kstream will get all 5 pieces of data exactly the same as the data in topic, and the order will not change. At this time, when you traverse ktable, because there are three different keys in the five records, you will get three records, each key corresponds to the latest value, and the order of the three data is consistent with the original order in topic. This is the same as Kafka’s log compact.

Kafka actual combat: (4) detailed explanation of Kafka stream
At this time, if you group kstream and ktable based on key and sum value, the result will be different. The results of kstream are < jack, 4 >, < lily, 7 >, < Mike, 4 >. The results of ktable are < Mike, 4 >, < jack, 3 >, < lily, 5 >.

5、State store

In stream processing, some operations are stateless, such as filtering operation (implemented by filer method in Kafka stream DSL). Some operations are stateful and need to record intermediate states, such as window operations and aggregate calculations. The state store is used to store intermediate states. It can be a persistent key value store, a HashMap in memory, or a database. Kafka provides topic based state storage.

The data records stored in topic are in the form of key value. At the same time, Kafka’s log com action mechanism can compact the historical data and keep the last value corresponding to each key, so as to reduce the total amount of data and improve the query efficiency without losing the key.

When constructing ktable, you need to specify its state store name. By default, the name is also the name of the topic used to store the state of the ktable. The process of traversing the ktable is actually the process of traversing its corresponding state store, or the process of traversing all the keys of the topic and taking the latest value of each key. In order to make the process more efficient, compact operation will be performed on the topic by default.
In addition, except ktable, all state calculations need to specify the state store name to record the intermediate state.

3、 How Kafka stream solves key problems in streaming system

1. Time

In streaming data processing, time is a very important attribute of data. Starting from Kafka 0.10, in addition to key and value, each record also adds the timestamp attribute. At present, Kafka stream supports three kinds of time

  • The time of the event. The time of the event is included in the data record. The occurrence time is specified by producer when the producer record is constructed. And it needs broker or topic to message.timestamp.type Set to createtime (the default) to take effect.
  • Message receiving time. That is, the time when the message is stored in the broker. When broker or topic message.timestamp.type Set to logappendtime. At this time, the broker will set the value of its timestamp attribute to the current machine time after receiving the message and before saving it to disk. Generally, the message receiving time is close to the event occurrence time, and it can replace the event occurrence time in some scenarios.
  • Message processing time. That is, the time when Kafka stream processes the message.

Note: Kafka stream can be implemented through org.apache.kafka . streams.processor.TimestampExtractor Interface custom record time.

2. Window

As mentioned earlier, streaming data is unbounded in time. Aggregation operations can only work on specific datasets, that is, bounded datasets. Therefore, it is necessary to select the bounded data from the unbounded data set according to the specific semantics in some way. Window is a very common way to set calculation boundary. Different streaming systems support similar but different windows.

The windows supported by Kafka stream are as follows.

(1) The definition of the window is shown in the figure below. It has two properties: window size and advance interval. Window size specifies the size of the window, that is, the size of the dataset for each calculation. Advance interval defines the time interval of output. A typical application scenario is to output the PV or UV of the website in the past 1 hour every 5 seconds.

Kafka actual combat: (4) detailed explanation of Kafka stream

(2) The definition of tumbling time window is shown in the figure below. It can be considered as a special case of hopping time window, that is, window size and advance interval are equal. Its characteristic is that each window is completely disjoint.

Kafka actual combat: (4) detailed explanation of Kafka stream

(3) Sliding window this window is only used when two kstreams perform join calculation. The size of the window defines the maximum time difference between the data records of kstream on both sides of the join being considered in the same window. Assuming that the size of the window is 5 seconds, the records with a record time difference less than 5 among the two kstreams participating in the join are considered to be in the same window, and the join calculation can be performed.

(4) Session window this window is used to aggregate the key after group. It needs to group the key, and then define the start point and end point of a window for the data in the group according to the business requirements. A typical case is that you want to calculate the time for a user to visit a website through session window. For a specific user (represented by key), the window of the user (key) starts when the login operation occurs, and ends when the exit operation or timeout occurs. At the end of the window, the user’s access time or number of clicks can be calculated.

3、Join

Kafka stream contains ksstream and ktable data sets, so it provides the following join calculation

  • Ktable join ktable is still ktable. If there is an update on any side, the ktable will be updated as a result.
  • Kstream join kstream the result is kstream. You must operate with a window, otherwise the join operation will never end.
  • The result of kstream join ktable / globalktable is kstream. Only when there is new data in kstream, join calculation will be triggered and the result will be output. When there is no new data in kstream, the update of ktable will not trigger join calculation and output data. And the update only takes effect for the next join. A typical usage scenario is that the order information in kstream is associated with the user information in ktable.

For the join operation, if you want to get the correct calculation results, you need to ensure that the same data in the ktable or kstream participating in the join is assigned to the same task. The specific method is

  • The key type of ktable or kstream participating in the join is the same (in fact, the business meaning should also be the same)
  • The number of partitions of the topic corresponding to the ktable or kstream participating in the join is the same
  • The final result of the partitioner strategy is equivalent (the implementation does not need to be exactly the same, as long as the effect is the same), that is, when the key is the same, it is assigned to the partition with the same ID

If the above conditions are not met, the following method can be called to make it meet the above conditions.

KStream<K, V> through(Serde<K> keySerde, Serde<V> valSerde, StreamPartitioner<K, V> partitioner, String topic)

4. Aggregation and disorder processing

Aggregation operations can be applied to kstream and ktable. When aggregation occurs on kstream, a window must be specified to limit the target dataset of the calculation.

It should be noted that the result of aggregation operation must be ktable. Because ktable is updatable, the result ktable can be updated when the late data arrives (that is, when the data is out of order).

Here is an example. Suppose that kstream takes 5 seconds as the window size and performs count operation on tumbling time window. And ksstream successively appears data of 1 second, 3 seconds and 5 seconds. At this time, the window of 5 seconds has reached the upper limit. Kafka stream closes the window, triggers count operation and outputs result 3 to ktable (assuming that the result is expressed as < 1-5,3 >). If a record of 2 seconds is received after 1 second, because the window of 1-5 seconds has been closed, if the data is discarded directly, the previous result < 1-5,3 > is considered inaccurate. If the complete result < 1-5,4 > is directly output to kstream, then kstream will contain two records of the window, < 1-5,3 >, < 1-5,4 >, and there will also be dirty data. Therefore, Kafka stream chooses to store the aggregation result in ktable, and the new result < 1-5,4 > will replace the old result < 1-5,3 >. Users can get complete and correct results.

This method ensures the accuracy of data and improves the fault tolerance.

However, it should be noted that Kafka stream does not recalculate and update the result set for all the late data. Instead, it allows the user to set a retention period to keep the result set of each window in memory for a certain period of time. When the data in the window arrives late, it directly combines and calculates, and updates the result ktable. After the retention period, the window result will be deleted from memory, and the late data will be directly discarded even if it falls into the window.

5. Fault tolerance

Kafka stream implements fault tolerance from the following aspects

  • Highly available partition ensures no data loss. Each task calculates a partition, and Kafka data replication mechanism ensures the high availability of data in the partition, so there is no risk of data loss. At the same time, because the data is persistent, even if the task fails, it can still be recalculated.
  • State storage realizes fast fault recovery and continuous processing from fault point. For instance, it can store state and join state. Even if a failure or consumer rebalance occurs, the intermediate state can still be recovered through the state store, so that the calculation can continue from the point before the failure or consumer rebalance.
  • Table and retention period provide the ability to process out of order data.

4、 Summary

  • The parallel model of Kafka stream is completely based on Kafka’s partition mechanism and rebalance mechanism, which realizes online dynamic adjustment of parallelism
  • The same task contains all processors of a sub topology, so that all processing logic is completed in the same thread, which avoids unnecessary network communication overhead and improves efficiency.
  • The through method provides a shuffle mechanism similar to spark, which makes it possible to join data with different partition strategies
  • Log compact improves the loading efficiency of Kafka based state store
  • State store provides the possibility of state calculation
  • The offset based calculation schedule management and the state store based intermediate state management make it possible to continue processing from the breakpoint when a consumer rebalance or failure occurs, and guarantee the fault tolerance of the system
  • With the introduction of ktable, aggregate computing has the ability to deal with out of order problems