The Way of MatrixOne Chaos Testing


Author: Su Dong MO Test Engineer


Migrating to cloud-native and adopting a distributed architecture/design approach to creating cloud-native applications is a major trend in recent years, and this trend is accelerating further. The most important driving factor is that it can greatly reduce application downtime, high elasticity, high resource utilization, etc., thereby adding more business value. However, this architecture definitely brings new challenges to software testing, and when it comes to testing cloud-native/distributed application systems, things will be different compared to traditional methods that people use to test other application systems such as microcontrollers. It gets complicated.Applications tend to be more dynamic, distributed, released at a faster rate as microservices, and have failure modes that are difficult to predict and track. However, the traditional testing technology will be stretched when covering such problems, so a seemingly new testing species – chaos testing, came into being.Subsequently, more and more testing concepts, theories, and technical tools related to chaos testing have been mentioned, discussed, and implemented. So, what exactly is chaos testing, what problems can chaos testing solve, and how should it be carried out? There are currently no authoritative answers, and there may never be a one-size-fits-all standard. As we all know, MatrixOne database inherently has all the advantages of cloud native and distributed, and naturally has a strong demand for chaos testing. Today’s article is to share the overall canvas of MatrixOne testing team’s chaos testing from the theoretical/methodological level.

Part 1 How to understand chaos testing

Chaos testing in the industry is generally called fault drill, which is a testable method based on fault simulation and injection to deal with chaos in large-scale distributed systems. However, chaos testing is fundamentally different from other testing types, not limited to testing, but more like engineering practice.Generally speaking, different industries and different product features have different classification standards and development requirements for software testing, but in summary, they are similar. Here, first share the MO test team’s test division standards for better development and management of product tests.
The Way of MatrixOne Chaos Testing
You will find that chaos testing does not seem to be able to be located in any test type, or is related to any test type, but it is not the same. Here, the definition formula of “software testing” given by Professor Zhu Shaomin is used to explain the MO testing team’s definition of chaos testing:
Test = detect known + test unknown
“A known”It means that the test objectives, test requirements and test verification criteria are clear and have good testability.“unknown”It means that the test objectives, test requirements, and test verification criteria are not clear, and it is difficult to directly verify. It needs continuous experiments to know whether the implemented functional characteristics are correct.
In layman’s terms, “known” is within manpower, while “unknown” is outside manpower, and it is very strict. All the tests in the above figure are carried out for “known” items, and chaos It is these “unknown” items that are tested against, and,We believe that for these “unknown” considerations, the following principles need to be followed:
1. “Unknown” may exist in any test dimension, occur in any test stage, and be hidden in any test object and test method; 2. The scope of “unknown” is also bounded, it includes only the team’s attention, but passed Quality factors that cannot be effectively resolved by manpower;
3. “Unknown” should and must be able to be evaluated and measured, otherwise any test for it will be meaningless, but the standard of this measurement can be vague, scoped, or progressively detailed;
4. The development of “unknown” experiments must be completed with the help of tools or engineering practices supported by a series of tools;
5. With the development of “unknown” experiments, more and more “unknown” items will become “known” items. We can further understand the chaos test through the following figure:
The Way of MatrixOne Chaos Testing
Therefore, in the testing system of MO, all testing efforts to make the quality elements of the current product can be more perceived, managed and evaluated belong to chaos testing.Chaos testing is not a new testing technology or subversive testing concept. It still needs to follow the essence of software testing, which is to provide quality information and confidence for product business activities. Therefore, the ultimate goal of chaos testing is to make ” Chaos” becomes “non-chaos”.

Part 2 How to conduct chaos testing

Through the previous explanation, the chaos test needs to solve the “unknown” problem, so the first prerequisite for the development of the chaos test is how to better discover these “unknown” items, and the industry basically agrees on this point. To sum it up, it is 4 words: fault injection.
Moreover, most of the current research on chaos testing also focuses on how to better implement the tool capabilities of fault injection, such as:
1. Chaos Monkey, a set of testing tools developed by NetFlix to test server stability. The core idea is to deliberately take the server offline, and then test the recovery ability of the cloud environment;
2. Chaos Mesh, an open source cloud-native chaos engineering platform of PingCAP, provides rich fault simulation types and has a powerful fault scenario orchestration capability;
3. ChaosBlade, a ChaosBlade open-sourced by Alibaba, follows the experimental principles of chaos engineering, provides rich fault scenario implementations, and helps distributed systems improve fault tolerance and recoverability. It can realize the injection of underlying faults.
These are excellent fault injection tools currently available, and they are also widely used in chaos testing in the industry. However, it is far from enough to solve fault injection. The tools are only support. If you want to better develop chaos Testing requires engineering thinking to design and layout chaos testing. After repeated experiments, the architecture diagram of the chaos test of the MO test team is shown in the figure below:
The Way of MatrixOne Chaos Testing
Core module:
01 Fault injection behavior
It is done through fault injection tools, and the purpose is to identify and discover potential “unknown” problem triggers in the system under test as much as possible. For fault injection, it can be purely random or based on a certain fault injection strategy. Our practice has proved that injecting faults according to a predefined strategy is more conducive to testing, because in the execution of chaos testing, the minimum explosion radius needs to be controlled through strategies.
Of course, the definition of the fault injection strategy is often related to the focus of the current chaos test. Here again, the chaos test is also bounded and purposeful, not a purely blind behavior. For example, if you want to verify the impact of network packet loss on The influence of the transaction success rate requires proper adjustment of the fault injection strategy, increasing the proportion of network delay faults and selecting key fault injection points.
02 Steady-state model definition
The so-called steady-state model is actually the effective state of the system under test. This effective state can be described by a set of quality factor rules that the measured object must comply with. For example: functional availability must satisfy r1, performance indicators such as satisfying r2, etc., then the steady-state model can be expressed as: M{r1,r2,…rn}
That is to say, no matter which faults are injected in chaos testing and which tests are performed, the system under test cannot violate any of the quality factor rules, namely:
The Way of MatrixOne Chaos Testing

For the definition of the steady-state model, the following principles need to be followed:The selection of quality elements depends on the quality dimensions that product testing itself needs to cover; the description of quality element rules may not be precise, but is based on scope; quality element rules must be calculated based on test results; It is a quantitative expression and can be compared by tools; the quality element rules depend on the maturity of product capabilities and testing capabilities, and are optimized iteratively.
03 Real event simulation execution
The selection of test events depends on the definition of the steady-state model, that is, the actual state of each quality element in the steady-state model can be calculated through the results of the final executed test events.
Of course, the execution of test events can be automated, which is a necessary prerequisite for the development of chaos testing. Therefore, the effective implementation of chaos testing must have certain requirements for the maturity of testing capabilities, and will also promote the further maturity of testing capabilities.
04 Behavior logging
Behavioral logging is a core element that is easily overlooked. As mentioned before, even if the chaos test finds a problem, it cannot clearly know what is the trigger of the problem, and the problem that cannot be located and solved is equivalent to an unknown problem.
Therefore, it is required to clearly record various behaviors and information in chaos testing, such as fault injection points, fault recovery points, executed test items, resource usage, etc., so as to provide enough materials for the final positioning and analysis of problems .
05 reverse optimization
Through the results of chaos testing, reverse optimization of test case sets and further improvement of failure mode library are also one of the important purposes of chaos testing. As for how each link of MatrixOne is designed in the practice of chaos, I won’t introduce too much this time, and there will be other articles to share further.

Part 3 How to Evaluate Chaos Testing

Regarding the maturity of hybrid evaluation chaos testing, there have been some explorations in the industry, such as Ali’s CMM model, but this model is too complicated and theoretical. Based on this, the MO testing team has customized a set of evaluation models of its own, and the entire chaos The development and evolution of the test is also carried out on the basis of this model.
The Way of MatrixOne Chaos Testing