Chaos mesh — let application dance with chaos on kubernetes


Author: Yin Chengwen

On December 31, 2019, we officially open source chaos mesh on GitHub. As a cloud native chaos test platform, chaos mesh provides the ability to test chaos on kubernetes platform. This article will introduce the origin and principle of chaos mesh, and lead us to explore the world of chaos testing with specific cases.

Chaos mesh -- let application dance with chaos on kubernetes

In the real world, all kinds of failures may occur anytime and anywhere, and many of them cannot be avoided. For example, the disk is suddenly damaged, or the computer room is suddenly disconnected from the network and power is cut off. These failures may cause huge losses to the company, so improving the tolerance of the system for failures has become the goal of many engineers.

In order to more easily verify the system’s tolerance to various faults, Netflix created a monkey named chaos and put it on the AWS cloud to inject various fault types into infrastructure and business systems. This monkey is the origin of chaos engineering.

We also faced the same problem in pingcap, so we started to explore chaos engineering at a very early time, and gradually implemented it in the company.

In the initial practice, we customized a set of automatic test platform for tidb, in which we can define our own test scenarios and support simulation of various error situations. However, due to the continuous maturity of tidb ecology, the emergence of various peripheral tools such as tidb binlog, tidb data migration, tidb lightning, and so on, the test requirements are more and more, and the test framework of each component gradually appears. However, the requirements of chaos experiments are common, and the general chaotic tools become particularly important. Finally, we separated the chaos related implementation from the automated test platform and became the original prototype of chaos mesh. After redesign and improvement, we finally open source on GitHub. The project address is: 。

What can chaos mesh do?

Chaos mesh -- let application dance with chaos on kubernetes

< center > abnormal QPS recovery time is found after tikv node downtime is injected with chaos mesh

In this paper, we use chaos mesh to simulate the change of QPS in the scenario of tikv outage. Tikv is the distributed storage engine of tidb. According to our expectation, in most cases, when the tikv node is down, the QPS may appear transient jitter, but when the tikv node is restored, the QPS can recover to the water level before the fault in a short time. It can be seen from the monitoring curve that the QPS can return to normal in a short time after the recovery of the tikv node in the first two times. However, in the last experiment, after the recovery of the tikv node, the QPS of the service did not return to the normal state in a short time, which is not in line with the expectation. Finally, through positioning and confirmation, the current version (v3.0.1) tidb cluster does have problems when handling tikv downtime, and has been fixed in the new version, corresponding PR: tidb / 11391, tidb / 11344.

The scenario described above is just one of our usual chaotic experiments. Chaos mesh also supports many other error injection:

  • Pod kill: simulate kubernetes pod to be killed.
  • Pod failure: simulate the continuous unavailability of kubernetes pod, which can be used to simulate the scenario of node downtime and unavailability.
  • Network delay: simulate network delay.
  • Network loss: simulate network packet loss.
  • Network duplication: simulate network packet duplication.
  • Network corrupt: simulate network packet corruption.
  • Network partition: simulate network partition.
  • I / O delay: simulates file system I / O latency.
  • I / O errno: simulates file system I / O errors.

Thinking behind

From the above introduction, we understand that the goal of chaos mesh is to make a general chaos testing tool, so we set down several principles at the beginning.

Ease of use

  • Without special dependency, it can be deployed directly on kubernetes cluster, including minikube.
  • In the chaotic environment, it is ideal to deploy the logic without modification.
  • It is easy to arrange the error injection behavior of the experiment, view the status and results of the experiment, and quickly roll back the injected fault.
  • Hiding the details of the underlying implementation, users are more focused on arranging their own experiments.


  • Based on the existing implementation, it is easy to expand new fault injection types.
  • Easy to integrate into other test frameworks.

As a general tool, ease of use is essential. No matter how many functions and how powerful a tool is, if it is not easy to use, the tool will eventually lose its users and the value of the tool itself.

On the other hand, under the premise of easy-to-use, expansibility is also essential. Today’s distributed systems are becoming more and more complex, and various new problems emerge in endlessly. The goal of chaos mesh is that when there are new requirements, we can easily implement them in chaos mesh, rather than rebuild a wheel.

Some hard core

Why kubernetes?

In the container circle, kubernetes can be said to be the absolute protagonist, whose growth rate is far faster than expected, and has won the war of containerization management and coordination without dispute. In other words, kubernetes is now more like an operating system on the cloud.

Tidb is a real cloud native distributed open source database product. At the beginning, our internal automated test platform was built on kubernetes, on which hundreds of tidb were running every day Cluster, carry out all kinds of experiments, including functional test, performance test, and a large part of them are all kinds of chaotic tests, simulating various situations that may appear in reality. In order to support these chaotic experiments, it is necessary to combine chaos and kubernetes.

Chaos mesh -- let application dance with chaos on kubernetes

Design of CRD

In chaos mesh, CRD is used to define chaos objects. In kubernetes ecology, CRD is a mature scheme to implement custom resources, and there are very mature implementation cases and tool sets for us to use. In this way, we can use the power of ecology to avoid making wheels repeatedly. And it can be better integrated into kubernetes ecology.

The original idea was to define all the fault injection types into a unified CRD object. However, during the actual design, it was found that such a design could not work. Because different error injection types differ too much, you can’t predict what type of error injection might be added later. It is difficult to have a structure to cover all scenarios. Or in the end, the structure becomes extremely complex and large, and it is easy to introduce potential bugs.

Therefore, the definition of CRD in chaos mesh can be freely used. According to different fault injection types, separate CRD objects are defined. If the newly added error injection conforms to the existing CRD object definition, you can expand the CRD object; if it is a completely different type of error injection, you can also add a new CRD yourself This design can separate the definition and logic implementation of different fault injection types from the top, which makes the code structure look clearer, and reduces the coupling degree and the probability of error. On the other hand, controller runtime provides a good encapsulation of controller implementation, which does not need to implement a set of controller logic for each CRD object, thus avoiding a lot of repetitive work.

At present, three CRD objects are designed in chaos mesh, namely podchaos, networkchaos and iochaos. From the naming, it is easy to distinguish the fault injection types corresponding to these CRD objects.

Take podchaos as an example

 action: pod-kill
 mode: one
     - tidb-cluster-demo
     "": "tikv"
   cron: "@every 2m"

Podchaos objects are used to implement errors related to the injection of pod itself. Action defines specific errors. For example, pod kill defines the behavior of random kill pod. In kubernetes, pod downtime is a very common problem. Many native resource objects will automatically handle this error, such as pulling up a new one Pod, but can our app really cope with such errors? Or what if pod doesn’t work?

Podchaos can simulate this behavior very wellselectorOption to delimit the range of experimental behavior that you want to inject chaos into, and use theschedulerDefine the time and frequency of chaotic experiment. For more details, please refer to the usage document of chaos mesh 。

Let’s go a little further and talk about how chaos mesh works.

Principle analysis

Chaos mesh -- let application dance with chaos on kubernetes

The above figure shows the basic workflow schematic diagram of chaos mesh

  • Controller-manager

    At present, controller manager can be divided into two parts: one is used to schedule and manage CRD object instances, the other is admission webhooks, which dynamically injects sidecar container into pod.

  • Chaos-daemon

    The chaos daemon runs in the mode of daemonset and has privileged privilege. It can operate the network equipment and CGroup of specific node nodes.

  • Sidecar

    Sidecar container is a special kind of container, which is dynamically injected into the target pod by admission webhooks. At present, the chaosfs sidecar container is implemented in chaos mesh, in which fuse daemon will run to hijack the I / O operation of application container.

The overall workflow is as follows:

  1. Users create or update the chaos object to kubernetes API server through yaml file or kubernetes client.
  2. Chaos Mesh creates update or delete events through the chaos object in the watch API server to maintain the running and life cycle of specific chaos experiments. In this process, controller manager, chaos daemon and sidecar container work together to provide the ability of error injection.
  3. Permission webhooks is an HTTP callback service used to receive access requests. When a pod creation request is received, the pod object to be created will be dynamically modified, such as injecting the sidecar container into the pod. Step 3 can also happen before step 2 and run when the application is created.

Say something practical

The above part introduces the working principle of chaos mesh. In this part, we will talk about how to use chaos mesh.

Chaos mesh needs to run at kubernetes v1.12 and above. The deployment and management of chaos mesh is realized through the package management tool helm on kubernetes platform. Before running chaos mesh, make sure helm is properly installed in the kubernetes cluster.

If there is no kubernetes cluster, you can quickly start a multi node kubernetes cluster locally through the script provided by chaos mesh

//Install kind 
curl -Lo ./kind$(uname)-amd64
chmod +x ./kind
mv ./kind /some-dir-in-your-PATH/kind 

//Get script
git clone
cd chaos-mesh
//Start cluster

If kubernetes cluster is started locally, the function of network related fault injection will be affected

After the kubernetes cluster is ready, you can install and deploy chaos mesh through helm and kubectl.

git clone
cd chaos-mesh
//Create CRD resources
kubectl apply -f manifests/
//Install chaos mesh
helm install helm/chaos-mesh --name=chaos-mesh --namespace=chaos-testing
//Check the status of chaos mesh
kubectl get pods --namespace chaos-testing -l

After all the components of chaos mesh are ready, you can play as much as you like!

Currently, there are two ways to use chaos mesh.

Define chaos yaml file

Yaml file is very convenient to define our own chaotic experiment. Under the premise that the user’s application has been deployed, the chaotic experiment can be carried out at the fastest speed.

For example, we have deployed a tidb cluster called chaos-demo-1 (tidb can be deployed using tidb operator). If users want to simulate the scenario where tikv pods are frequently deleted, they can write the following definitions:

kind: PodChaos
  name: pod-kill-chaos-demo
  namespace: chaos-testing
  action: pod-kill
  mode: one
      - chaos-demo-1
      "": "tikv"
    cron: "@every 1m"

Create a yaml file containing the abovekill-tikv.yamlAfter that, executekubectl apply -f kill-tikv.yamlThe corresponding error will be injected into the chaos-demo-1 cluster.

Chaos mesh -- let application dance with chaos on kubernetes

In the demo above, the sysbench program has been testing the tidb cluster. When the error is injected into the cluster, the sysbench QPS jitters obviously. After observing the pod, it is found that a certain tikv pod has been deleted, and kubernetes recreates a new tikv pod for the tidb cluster.

For more yaml file examples:… 。

Using kubernetes API

Chaos mesh uses CRD to define chaos objects, so we can operate on our CRD objects directly through kubernetes API. In this way, it is very convenient to apply our chaos mesh to our own programs, customize various test scenarios, and make the chaotic experiment automatic and continuously run.

For example, in the test infra project, we use chaos mesh to simulate etcd cluster exceptions in kubernetes environment, such as simulating node restart, simulating network failure, simulating file system failure, and so on.

Kubernetes API example:

import (

func main() {
    delay := &chaosv1alpha1.NetworkChaos{
        Spec: chaosv1alpha1.NetworkChaosSpec{...},
    k8sClient := client.New(conf, client.Options{ Scheme: scheme.Scheme })
    k8sClient.Create(context.TODO(), delay)
    k8sClient.Delete(context.TODO(), delay)

Talk about the future

In addition to the chaos of the infrastructure layer described above, we can also inject fault types at a wider and more fine-grained level.

With the help of ebpf and other tools, we can inject specific errors into the system call and kernel level, and it is more convenient to simulate the scenario of physical machine power failure.

By integrating failpoints, we can even inject specific error types into application functions and statements, which will greatly cover the scenarios that cannot be covered by conventional injection methods. The most attractive thing is that these fault injection can be injected into application and system level through consistent interface.

In addition, we will support and improve the chaos mesh dashboard, better visualize the impact of fault injection on business, and provide an easy-to-use fault scheduling interface to help businesses implement fault injection more easily, and understand the tolerance of applications for different types of errors and the ability of fault recovery.

In addition to verifying the fault tolerance of the application, we also hope to quantify the recovery time of the service after fault injection, and move the chaos capability to cloud platforms around the world. These requirements will generate various components such as chaos mesh verifier, chaos mesh cloud and other components closely surrounding the ability of chaos, so as to implement a more comprehensive inspection of distributed systems.

Come on! Join us!!

Having said so much, last but not least, the chaos mesh project has just started. Open source is just a starting point. We need to participate together. Let our application and chaos dance on kubernetes together!

If you find any bugs or missing functions in the use process, you can directly mention issue or PR on GitHub to participate in the discussion.

GitHub address:

Chaos mesh -- let application dance with chaos on kubernetes