Apache Kafka is an open source distributed messaging platform that delivers data with high throughput and low latency. In terms of scale, it can process trillions of records every day, while providing fault tolerance, replication and automatic disaster recovery.
Although Kafka has many different usage scenarios, the most common is as a message broker between applications. Kafka can receive, process and redistribute messages from multiple upstream sources to multiple downstream consumers without reconfiguring the application. This can stream large amounts of data while maintaining loose coupling of applications, supporting scenarios such as distributed computing, logging and monitoring, website activity tracking, and Internet of things (IOT) device communication.
Since Kafka provides a critical pipeline between applications, reliability is critical. We need plans to mitigate several possible failure modes, including:
- Message broker outages and other abnormal cluster conditions
- Apache zookeeper fails, which is a key dependency of Kafka
- Failures in upstream and downstream applications
Without waiting for these faults to occur in pre production or production, we can actively test them through chaos engineering in order to formulate appropriate strategies to reduce the impact. In this article, we will demonstrate how chaos engineering can help improve the reliability of Kafka deployment. To this end, we will use gremlin, the enterprise SaaS chaos engineering platform, to create and run four chaos experiments. By reading this article, we can understand the different ways that Kafka cluster may fail, how to design chaotic experiments to test these failure modes, and how to use the observation results to improve its reliability.
In this article, we will demonstrate the chaotic experiment on the confluent platform, which is the enterprise event flow platform provided by Kafka’s original creator. The fluent platform builds and adds enterprise functions based on Kafka (such as web-based GUI, comprehensive security control and easy deployment of multi area clusters). However, the experiments in this paper will apply to any Kafka cluster.
Apache Kafka Architecture Overview
In order to understand how Kafka benefits from chaos engineering, we should first study Kafka’s architecture design.
Kafka uses a publisher / subscriber (or pub / sub) messaging model to transfer data. Upstream applications (called publishers or producers in Kafka) generate messages sent to Kafka servers (called brokers). Downstream applications (called subscribers or consumers in Kafka) can then obtain these messages from the broker. Messages are organized in a category of topics, and consumers can subscribe to one or more topics to use their messages. By acting as an intermediary between producers and consumers, Kafka enables us to manage upstream and downstream applications independently of each other.
Kafka subdivides each topic into multiple partitions. Partitions can be mirrored across multiple brokers to provide replication. This also allows multiple users (more specifically, groups of users) to work on a topic at the same time. To prevent multiple producers from writing to a single partition, each partition has a broker acting as a leader and no or more brokers Acting as followers. New messages are written to leaders and copied by followers. When a follower is fully replicated, it is called a synchronous replica (ISR).
This process is coordinated by Apache zookeeper, who manages metadata about Kafka clusters, such as which partitions are assigned to which brokers. Zookeeper is a necessary dependency of Kafka (editor’s note: zookeeper is not required in version 2.8), but it runs as a completely independent service on its own cluster. Improving the reliability of Kafka cluster inevitably involves improving the reliability of its associated zookeeper cluster.
Kafka and fluent platforms have other components, but these are the most important considerations when improving reliability. When this article introduces other components, we will explain them in more detail.
Why chaos engineering on Kafka?
Chaos engineering is a method of actively testing system faults in order to make them more resilient. We observe the impact and solve the observed problems by injecting a small number of controlled faults into the system. This enables us to find solutions for users before problems occur in the system, and also teaches us more information about the behavior of the system under various conditions.
Due to numerous configuration options, flexible deployment methods of producers and consumers and many other factors, distributed systems such as Kafka are difficult to manage and operate efficiently. It is not enough to keep our broker and zookeeper nodes from failing. We need to consider more subtle and unpredictable problems that may occur in applications, replicas and other infrastructure components. These may affect the entire deployment in unexpected ways, and if they occur in production, they may require a lot of troubleshooting overhead.
Using chaos engineering, we can actively test these types of faults and solve them before deployment to production, so as to reduce the risk of downtime and emergencies.
Running chaos experiment on Kafka
In this section, we will gradually introduce the deployment and implementation of four different chaotic experiments on the confluent platform. Chaos experiment is a planned process, which injects fault into the system to understand its response. Before running any experiment on the system, the experiment to be run should be fully considered and developed.
When creating an experiment:
- The first step is to set the questions to be answered by the hypothesis and what the expected results are. For example, if the experiment is to test the ability to withstand broker interruptions, the hypothesis may point out: “if the broker node fails, the message will be automatically routed to other brokers without losing data.”
- The second step is to define the explosion radius and the infrastructure components affected by the experiment. Reducing the explosion radius limits the potential harm that experiments can cause to the infrastructure, while focusing on specific systems. We strongly suggest that we start with the explosion radius as small as possible, and then increase it with the improvement of adaptability to chaotic experiments. In addition, the amplitude should also be defined, that is, the scale or influence of the injection attack. For example, a low amplitude experiment may be to add a delay of 20 milliseconds to the network traffic between the producer and the broker. A substantial experiment may be to increase the delay by 500 milliseconds, because this will have a significant impact on performance and throughput. Like the explosion radius, it starts from low amplitude and then increases gradually.
- The third step is to monitor the infrastructure. Determine which indicators will help draw conclusions about the hypothesis, observe before the test to establish a baseline, and record these indicators throughout the test so that expected and unexpected changes can be monitored. With the fluent platform, we can use the control center to visually observe the performance of the cluster from the web browser in real time.
- The fourth step is to run the experiment. Gremlin allows experiments to be run on applications and infrastructure in a simple, secure, and reliable manner. We do this by running an injection attack, which provides a variety of ways to inject faults into the system. We also define the abort condition, which is the condition that we should stop the test to avoid accidental damage. Using gremlin, we can define state checking as part of the scenario. Through the status check, we can verify the service status during the injection attack. If the infrastructure is not working properly and the status check fails, the experiment will be stopped automatically. In addition, we can use the built-in pause button to stop the experiment immediately.
- The fifth step is to draw a conclusion from the observations. Does it confirm or refute the original hypothesis? Use the collected results to modify the infrastructure, and then design new experiments around these improvements. Over time, repeating this process will help make Kafka deployment more resilient. The experiments introduced in this paper are not exhaustive, but should be used as the starting point of experiments on the system. Remember that although we are running these experiments on the confluent platform, they can be performed on any Kafka cluster.
Please note that we are using fluent platform 5.5.0 based on Kafka 2.5.0. Screenshots and configuration details may vary by version.
Experiment 1: the impact of broker load on processing delay
Resource utilization may have a significant impact on message throughput. If the broker is experiencing high CPU, memory, or disk I / O utilization, the ability to process messages will be limited. Since Kafka’s efficiency depends on the slowest components, delays may have cascading effects throughout the pipeline and lead to failure conditions, such as producer backup and replication delays. High load will also affect cluster operations, such as broker health check, partition redistribution and leader election, so that the whole cluster is in an abnormal state.
The two most important metrics to consider when optimizing Kafka are network latency and disk I / O. Broker constantly reads and writes data in local storage. With the increase of message rate and cluster size, bandwidth usage may become a limiting factor. When determining the cluster size, we should determine where resource utilization will adversely affect performance and stability.
To determine this, we will conduct a chaos experiment to gradually improve the disk I / O utilization between brokers and observe its impact on throughput. When running this experiment, we will use the Kafka music demo application to send continuous data streams. The application sends messages to multiple topics distributed across all three brokers and aggregates and processes messages using Kafka streams.
Generating broker payload using IO Gremlin
In this experiment, we will use IO gremlin to generate a large number of disk I / O requests on the broker node. We will create a scheme and gradually increase the intensity of injection attacks in four stages. Each injection attack will run for three minutes with an interval of one minute, so we can easily relate the change of I / O utilization to the change of throughput.
In addition, we will create a status check that uses the Kafka monitoring API to check the health of the broker between each stage. The status check is to send an automatic HTTP request to the selected endpoint through gremlin. In this case, the endpoint is the rest API server of our cluster. We will use the endpoint of the topic to retrieve the state of the broker and parse the JSON responses to determine whether they are currently in sync. If any broker is out of sync, we will immediately stop the experiment and mark it as failed. We will also use the fluent control center to monitor throughput and latency while the scenario is running.
- Assumption: the increase of disk I / O will lead to the corresponding decrease of throughput.
- Conclusion: even if the disk I / O is increased to more than 150 MB / s, the technical attack will not have a significant impact on throughput or latency. Both indicators remained stable, and our brokers did not lose synchronization or insufficient replication, and no messages were lost or damaged.
At present, this leaves us a lot of overhead, but with the expansion of application scope, the requirements for throughput may increase. We should pay close attention to disk I / O utilization to ensure sufficient expansion space. If you begin to notice an increase in disk I / O and a decrease in throughput, consider:
- Use faster storage devices, such as higher RPM disks or solid-state storage
- Use a more efficient compression algorithm, such as snappy or lz4
Experiment 2: risk of data loss caused by message loss
To ensure successful message delivery, producers and brokers use validation mechanisms. When the broker submits a message to its local log, it will confirm with the response producer that the message has been successfully received and the producer can send the next message. This can reduce the risk of message loss between producers and brokers, but can not prevent message loss between brokers.
For example, suppose we have a broker leader who has just received a message from the producer and sent a confirmation. Each subscriber to the broker should now get the message and submit it to their own local log. However, the broker fails unexpectedly before any of its subscribers get the latest message. None of the followers knows that the producer sent a message, but the producer has received an acknowledgement, so it has moved to the next message. Unless we can recover the failed broker or find another way to resend the message, the message has actually been lost.
How do we determine the risk of this happening on the cluster? With chaos engineering, we can simulate broker led failures and monitor message flow to determine potential data loss.
Simulate broker leadership interrupt using black hole Gremlin
In this experiment, we will use the black hole gremlin to delete all network traffic leading to and from the broker. This experiment depends largely on the schedule, because we want to cause the broker to fail after the broker receives the message, but before its subscribers can copy the message. This can be done in two ways:
- At a lower time interval than the followers, send a continuous message flow, start the experiment, and look for blanks in the user’s output (replica. Fetch. Wait. Max.ms).
- Immediately after sending the message, the chaotic experiment is triggered from the producer application using the gremlin API.
In this experiment, we will use the first method. The application generates a new message every 100 milliseconds. The output of the message flow is recorded as a JSON list and analyzed to find any gaps or timing inconsistencies. We will inject the attack for 30 seconds, which will generate 300 messages (one message every 100 milliseconds).
- Hypothesis: we will lose some messages due to the leader’s failure, but Kafka will quickly select a new leader and successfully copy the messages again.
- Result: Despite the leader’s sudden failure, the message list still shows all successful messages. Since additional messages were recorded before and after the experiment, our pipeline generated a total of 336 events, and each message has a timestamp of about 100 milliseconds after the previous event. Messages are not displayed in chronological order, but this is good because Kafka does not guarantee the order of messages between partitions. This is an example of the output:
If you want to ensure that all messages are saved, you can set acks = all in the producer configuration. This tells the producer not to send a new message until the message is copied to the broker leader and all its subscribers. This is the safest option, but it limits throughput to the slowest broker speed, so it can have a significant impact on performance and latency.
Experiment 3: avoid brain crack
Kafka, zookeeper and similar distributed systems are vulnerable to a problem called “brain fissure”. In cerebral fissure, two nodes in the same cluster lose synchronization and partition, resulting in two separate and possibly incompatible views in the cluster. This can lead to data inconsistency, data corruption, and even the formation of a second cluster.
How did this happen? In Kafka, a single broker node is assigned a controller role. The controller is responsible for detecting changes in cluster state, such as failed brokers, leader elections, and partition assignments. Each cluster has only one and only one controller to maintain a single consistent view of the cluster. Although this makes the controller a single point of failure, Kafka has a process to deal with such failures.
All brokers will register with zookeeper regularly to prove their health. If the response time of the broker exceeds the zookeeper.session.timeout.ms setting (18000 MS by default), zookeeper will mark the broker as abnormal. If the broker is the controller, the controller election is triggered and the replica ISR becomes the new controller. The new controller is assigned a number called controller era, which tracks the latest controller election. If the failed controller comes online again, it will compare its own controller era with the era stored in zookeeper, identify the newly selected controller, and then return to a normal broker.
This process can prevent a few brokers from failing, but what if most brokers have major failures? Can we restart them without creating a brain crack? We can use chaos engineering to verify this.
Restart most broker nodes using Shutdown gremlin
In this experiment, we will use shutdown gremlin to restart two of the three broker nodes in the cluster. Since this experiment may pose potential risks to cluster stability (for example, we don’t want to accidentally shut down all zookeeper nodes), we want to ensure that all three brokers should be in a healthy state before running the broker. We will create a status check to get a list of healthy brokers from the Kafka monitoring API to verify that all three brokers are started and running.
This is our fully configured scenario, showing status check and shutdown gremlin:
- Suppose: Kafka’s throughput will be temporarily stopped, but both broker nodes will rejoin the cluster without problems.
- Result: the control center still lists three brokers, but shows that two of them are out of sync and insufficient partition replication. This is expected because the node has lost contact with other brokers and zookeepers.
When the previous controller (broker1) goes offline, zookeeper immediately elects the remaining brokers (broker3) as the new controller. Since the two brokers restart without exceeding the zookeeper session timeout, it can be seen from the graph of broker normal running time that they are always online. However, when we move our message pipeline to broker 3 and look at the graph of throughput and replica, we will find that this has a significant impact on throughput and partition health.
Nevertheless, it was no surprise that broker rejoined the cluster. It can be concluded that our cluster can withstand most temporary failures. The performance will decline significantly. The cluster will need to elect new leaders, reassign partitions, and replicate data among other brokers, but it will not fall into a brain crack situation. This result may be different if it takes longer to restore the broker, so we want to make sure that an event response plan has been developed in the event of a major production interruption.
Experiment 4: zookeeper interrupt
Zookeeper is the basic dependency of Kafka. It is responsible for activities such as identifying brokers, electing leaders, and tracking the distribution of partitions between brokers. Zookeeper interruption does not necessarily lead to Kafka failure, but if it takes longer to solve, it may lead to unexpected problems.
In one example, hubspot encountered a zookeeper failure due to a large number of backup requests for a short time. Zookeeper failed to recover within a few minutes, which led to the Kafka node crashing. As a result, the data was corrupted and the team had to manually restore the backup data to the server. Although this is an unusual situation solved by hubspot, it emphasizes the importance of testing zookeeper and Kafka as separate services and overall services.
Simulation of zookeeper interrupt using black hole Gremlin
In this experiment, we want to verify that Kafka cluster can survive accidental interruption of zookeeper. We will use the black hole gremlin to discard all traffic to and from the zookeeper node. We will run the injection attack for five minutes while monitoring the cluster status in the control center.
- Assumption: Kafka can tolerate short-term zookeeper interruption without crashing, losing data or damaging data. However, any changes to the cluster state will not be resolved until zookeeper is back online.
- Results: running the experiment had no results on message throughput or broker availability. As we have assumed, messages can continue to be generated and used without unexpected problems.
If one of the brokers fails, the broker cannot rejoin the cluster until zookeeper comes back online. This alone is unlikely to lead to failure, but it may lead to another problem: cascading failure. For example, the failure of a broker will cause the producer to transfer the burden to other brokers. If these brokers approach the production limit, they will collapse in turn. Even if we bring the failed broker back online, we will not be able to join the cluster until zookeeper is available again.
The experiment shows that we can tolerate the temporary failure of zookeeper, but we should still work quickly to bring it back online. We should also find ways to mitigate the risk of complete interruption, such as distributing zookeepers between multiple areas to achieve redundancy. Although this will increase operating costs and increase delays, it is a small cost compared to causing production failures in the entire cluster.
Further improve the reliability of Kafka
Kafka is a complex platform with a large number of interdependent components and processes. For Kafka to operate reliably, planning, continuous monitoring and active fault testing are required. This applies not only to our Kafka and zookeeper clusters, but also to applications using Kafka. Chaos engineering enables us to find the reliability problems in Kafka deployment safely and effectively. Preparing for today’s failures can prevent or reduce the risk and impact of future failures, thus saving the organization’s time, workload and saving customers’ trust.
Now, we have shown four different chaos experiments of Kafka. Please try registering gremlin free account to run these experiments. When creating experiments, consider potential failure points in Kafka deployment (for example, the connection between broker and consumer) and observe how they respond to different injection attacks. If a problem is found, implement the fix and repeat the experiment to verify that it solves the problem. As the system becomes more reliable, gradually increase the amplitude (strength of the experiment) and explosion radius (number of systems affected by the experiment) in order to fully test the whole deployment.
Source: chaotic engineering practice Author: Li Dashan
Original author: Andree Newman original source: gremlin.com
Original title: the first 4 chaos experiments to run on Apache Kafka
Statement: the article was authorized by the author to be forwarded in the IDCF community official account (devopshub). High quality content is shared with the technical partners of SIFO platform. If the original author has other considerations, please contact Xiaobian to delete it and thank you.
In July, 8 pm every Thursday, brother Dong has a official account of research and development efficiency tools.
- On July 8, Zhou Wenyang published “azure Devops” of Microsoft Devops tool chain
- On July 15, Chen Xun’s “efficiency improvement practice under complex R & D cooperation mode”
- On July 22, he publicized the “exploration of automated testing of infrastructure, that is, code”
- On July 29, Hu Xianbin published automated testing, how to achieve “both attack and defense”