Distributed system testing — error injection

Time:2021-11-25

This topic series is compiled from the live record of the topic “deep exploration of distributed system testing” shared by Liu Qi in the 26th issue of pingcap newsql meetup. The article is long. In order to facilitate everyone’s reading, it will be divided into three parts. This article is the middle part.

Continued:
Of course, testing may make your code less beautiful. For example:

Distributed system testing -- error injection

This is the code of the well-known kubernetes, that is, it has a daemon set controller, which injects three test points. For example, this place injects a handler. You can think that all injects are interfaces. For example, if you write a simple 1 + 1 = 2 program, suppose we write a calculator whose function is to sum, it is difficult to inject errors. So you have to inject test logic into your correct code. Another example is that someone calls your add function, and then do you have an error? The problem with this error is that it may never return an error, so you have to inject human flesh and see if the application behaves correctly. After adding, let’s do a division. We all know that there may be exceptions in division. Can it be handled normally? No, it says, for example, 6 ÷ 3, and then it says a test, coverage 100%, but a division by zero exception causes the system to crash, so you need to inject errors at this time. The well-known kubernetes uses a similar method to test various exception logic. This structure is not long, about a dozen members, and then three points are injected into it to inject errors.

So how did we consider test when designing tidb? First of all, a million level test cannot be written by human flesh. That is to say, if you redefine your so-called SQL syntax or a query language, you need to build a million level test at this time. Even if the whole company writes it for two years, it is not enough. Therefore, this thing is obviously unreliable. However, unless my query language is particularly simple, such as the early mongodb, I have a “greater than how much” or equal conditional query, you really don’t need to build such a million level test. However, if you want to build an SQL database, you need to build this very, very complex test. At this time, this test can’t be written by the whole company for two years, can it? So what’s the good way? Various MySQL compatible systems can be used for test, so we were compatible with MySQL protocol at that time, which means that we can obtain a large number of MySQL tests. I don’t know if anyone has counted how many tests MySQL has. Product level tests are scary, tens of millions. Then there are many ORM, and various applications supporting MySQL have their own tests. As we all know, each language will build its own ORM, and then there are even several orm of a language. For example, MySQL may rank first and second, so we can take them all to test our system.

But for some applications, it’s more difficult at this time. It’s an application. You have to setup it, then operate the application, such as WordPress, and then look at the results. So at this time, in order to avoid human flesh testing just now, we made a program automatic record — replay. When you run it for the first time, we will record all the executed SQL statements. What about the next time I need to run this program again? I don’t need to run this program. I don’t need to get up. I just need to replay the SQL record recorded in front of it, which is equivalent to simulating the whole behavior of the program. So we do this in this part of automation.
So what did you actually do after saying so much? In fact, all tests are done on the correct path, and millions of tests are also done on the correct path, but what about the wrong path? A typical example is how to do fault injection. The hardware is relatively simple and rough. You can unplug the network cable when simulating network failure. For example, you can unplug the network cable when testing the network, but this method is extremely inefficient, and it can’t scale, because it requires people’s participation.

Then there is a CPU, for example. In fact, the damage probability of this CPU is also very high, especially for the guaranteed machines. Then there are disks. The damage rate of disks is about 8.0% in three years. This is the data given in a paper. I remember that Google seems to have given a data before, that is, the damage rate of CPU, network card and disk in how many years.

Another thing we don’t pay much attention to is the clock. Previously, we found that the system clock bounced back, and then we decided to add a monitoring module to the program. Once the system clock bounced back, we immediately detected it. Of course, when we first monitored this thing, the user thought it was impossible. Will the clock bounce back? I said it didn’t matter. First I opened our program to monitor it, and then it was detected after a period of time that the system clock recently bounced back. So how to match NTP is very important. Then there are more, such as the file system. Have you considered what happens when you write to the disk? OK, there is no error when writing the disk. It succeeds. Then one sector of the disk is broken and the read data is damaged. What should I do? Do you have a checksum? Without checksum, we directly use this data and return it directly to the user. At this time, it may be fatal. If this data just stores metadata, and the metadata points to other data, and then you write another data according to the metadata information, it will be even worse. The data may be further damaged.

So what’s the better way?

  • Fault injection

    • Hardware

      • disk error

      • network card

      • cpu

      • clock

    • Software

      • file system

      • network & protocol

  • Simulate everything

Simulate everything. If the disk is simulated and the network is simulated, we can monitor it. You can inject all kinds of errors at any time and in any scene. You can inject any error you want. For example, if you write a disk, I will tell you that the disk is full, I will tell you that the disk is broken, and then I can let you hang, such as sleep for more than 50 seconds. We do have this situation on the cloud, that is, we write once, then hang it for 53 seconds, and finally write it in. It must be a network disk, right? This kind of thing is actually very scary, but certainly no one will want to say that it takes 53 seconds for me to write to the disk, but when 53 seconds appear, what is the behavior of the whole program? A lot of rafts were used in tidb, so a situation occurred at that time, that is, 53 seconds, and then all machines began to vote. They said that there must be something wrong. They re selected all the leaders. At this time, the buddy of card 53 seconds said, “I’ve finished writing”, and then the whole system state was completely migrated. What are the benefits of this error injection? It’s important to know how serious your error can be when it goes wrong. It’s predictable. The whole system should be predictable. If you don’t test the wrong path, it’s a simple question. Now suppose you go to one of the wrong paths, what is the behavior of the whole system? It’s scary not to know this. You don’t know whether it is possible to destroy the data; Or the business side will block; Or will there be a retry in the business?

I had an interesting problem in the past. At that time, we were working on a message system, and a large number of connections would be connected to this. A single machine was connected to about 800000 connections, that is, message push. Then I remember that the swap partition was opened at that time. What was the concept of opening? When you have more connections coming in, then your memory will explode, right? If the memory explodes, the swap partition will be automatically enabled, but once you enable the swap partition, your system will become a dog. After the external user disconnects, he will fail and he will have to reconnect. However, it may take another 30 seconds until your normal program can respond. Then the user must feel that it has timed out, cut off the connection and reconnect, resulting in what state? The system is always trying again and never succeeds. Is this behavior predictable? Was this error well tested at that time? These are very important lessons.

The previous method of hardware testing is as follows (joke):
Distributed system testing -- error injection

Suppose one of my disks is broken, one of my machines is broken, and another assumes that it is not necessarily broken or hung. For example, what happens when it is on fire? Two months ago, it was a bank in Switzerland or somewhere. The guy was also very funny. The human flesh blew at the server to see the change of monitoring data, and then the alarm began immediately. This is just blowing air. For more complex tests, for example, where you start a fire, burn to the hard disk first, or burn to the network card first, the results may also be different. Of course, this cost is very high, and it is not a scheme that can scale, and it is also difficult to copy.

This is not only hardware monitoring, but also wrong injection. For example, what happens if I burn one cluster now? It’s on fire. It’s typical. Although important computer rooms have various strategies such as fire prevention and waterproof, what should we do when it’s on fire? Of course, you can’t really burn it. This burn may break more than one machine, but we need to use fault injection to simulate.

Let me introduce what fault injection is. To give an intuitive example, we all know that everyone has used UNIX or Linux systems. We all know that many people are used to opening this system. The first command is ls to list the files in the directory, but have you ever thought of an interesting question? If you want to test the correctness of the LS command implementation, how? If there is no source code, how should the system be tested? If you regard it as a black box, how should this system be tested? What if there is a disk error while you are LS? What happens if reading a sector fails?
This is a very fun tool. I recommend you to play it. Before you do more in-depth testing, you can first understand what fault injection is, and you can experience its power. Later, we will use it to find a MySQL bug.

libfiu – Fault injection in userspace

It can be used to perform fault injection in the POSIX API without having to modify the application’s source code, that can help to test failure handling in an easy and reproducible way.

This thing is mainly used to hook these APIs. It is very important that it provides a library, which can also be embedded into your program to hook those APIs. For example, when you read a file, it can return you that the file does not exist, and it can return you a disk error, etc. Most importantly, it can be repeated.

For example, normally, we can display the current directory when we hit the LS command.

Distributed system testing -- error injection

What does this program do? It’s run. Specify a parameter. Now there should be an enable_ Random means that all operations of the following APIs for IO have a failure rate of 5%. The first time I was lucky and didn’t encounter failure, so we listed the whole directory. Then we run again. At this time, it tells me that it failed to read once. When it reads the directory, it encounters a bad file descriptor. At this time, it can be seen that the listed files are less than those above, because there is a path that makes it fail. Next, we went further and found that a directory had just been listed, and then an error occurred the next time we read it. Then when I ran again later, I was lucky this time. I listed the whole. This is only the simulated failure rate of 5%. There is a 5% probability that you will fail when you go to read and open. At this time, you can see that the behavior of LS command is still very stable, that is, there are no common segment faults.

You may say that this is not very fun, that is, to find out whether there is a bug in the LS command. Let’s reproduce the MySQL bug and play it.

Bug #76020

InnoDB does not report filename in I/O error message for reads

fiu-run -x -c “enable_random name=posix/io/*,probability=0.05” bin/mysqld –basedir=/data/ushastry/server/mysql-5.6.24 –datadir=/data/ushastry/server/mysql-5.6.24/76020 –core-file –socket=/tmp/mysql_ushastry.sock –port=15000

2015-05-20 19:12:07 31030 [ERROR] InnoDB: Error in system call pread(). The operating system error number is 5.

2015-05-20 19:12:07 7f7986efc720 InnoDB: Operating system error number 5 in a file operation.

InnoDB: Error number 5 means ‘Input/output error’.

2015-05-20 19:12:07 31030 [ERROR] InnoDB: File (unknown):

‘read’ returned OS error 105. Cannot continue operation

This is a MySQL bug found with libfiu. The bug number is 76020, which means that InnoDB did not report the file name when it made an error. The user reported an error to you. You are stupid at this time, right? What’s wrong with this? Then how did it get out of this place? You can see that it still uses the FIU run we just mentioned, and then simulates. The simulation failure probability is still so many. You can see that one of our parameters has not changed. At this time, start mysql, run and appear. You can see that InnoDB does not report filename, file: ‘read’ returned OS error, and then auto error, You don’t know which file name it is.

To put it another way, suppose there is no such thing, what is the cost of reproducing this bug? You can think about how to reproduce this bug and make things read by MySQL go wrong without this thing? It’s too difficult for you to make it read errors under the normal path. It may not appear for many years. At this time, let’s further zoom in. This is also in 5.7 and mysql. It is likely that we have not encountered this bug for more than ten years, but this bug can be found immediately with the help of this tool. Therefore, fault injection brings a very important advantage, which is to make a thing easier to reproduce. This is also a simulated 5% probability. I did this example last night. I want to give you an intuitive understanding, but error injection in distributed systems is more complex than this. And if you encounter a mistake that hasn’t appeared for ten years, are you too lonely? You may still have an impression of this film. Starring Will Smith, the world lives alone and the only partner is a dog.

Distributed system testing -- error injection

In fact, it’s not. People who are more painful than us exist.

Take Netflix as an example. The following figure shows the Netflix system.

Distributed system testing -- error injection

In October 2014, they wrote a blog called failure injection testing, which talked about how their whole system does error injection. Then they said Internet scale, which is the level of the whole Multi Data Center Internet. You may remember that when spanner first came out, they were called global scale. Then you can see that blue is the injection point and black is the network call, which is all these requests. Under these circumstances, all these blue boxes may make mistakes. You can think about it. On the microservice system, a business call may involve dozens of system calls. What happens if one of them fails? What about the first failure, the second failure, the third failure and the third failure? Has any system ever been tested like this? Whether there is a system in its own program to verify whether every predictable error is predictable has become very important. Taking cache as an example, it is said that every time we visit Cassandra, there may be an error, which will give us a wrong injection point.

Then let’s talk about openstack

OpenStack fault-injection library:

https://pypi.python.org/pypi/os-faults/0.1.2

In fact, the famous openstack also has a failure injection library. Then I post this example here. If you are interested, you can take a look at the failure injection of openstack. In the past, we may not pay much attention. In fact, we are very painful at this point. Now there are a lot of people scolding openstack for its poor stability. In fact, they have worked very hard. But the whole system is really unusually complex because there are too many components. If you make a lot of mistakes, it may bring another problem, that is, the points that make mistakes can be combined, that is, a makes mistakes first, then B makes mistakes, or AB makes mistakes. These are just a few cases. It’s OK. If you have 100000 wrong points, how do you do this combination? Of course, there are still new papers studying this. There seems to be a paper in 2015 that will detect the path of your program and inject errors under the corresponding path.

Let’s talk about Jepsen

Jepsen: Distributed Systems Safety Analysis

Distributed system testing -- error injection

Basically, all the well-known open source distributed systems you have heard of have found bugs by it. But before that, everyone thought they were OK, and our system was relatively stable, so when a new tool or new method appeared, such as the paper I just mentioned that can check errors in linear scale, the error checking force would be amazing, because it can automatically help you detect errors. In addition, I will introduce a tool NAMAZU. Later, it is also very powerful. Let’s start with Jepsen, which is a heavy weapon. All these bugs have been found in zookeeper, mongodb, redis, etc. all the databases used now are the bugs found by it. The biggest problem is that it is written in the minority language closure, which is a little troublesome to expand. Let me first talk about the basic principle of Jepsen. A typical test using Jepsen will run the relevant clojure program on a control node, and the control node will log in to the relevant system node (Jepsen is called DB node) Using SSH for some test operations.

When our distributed system is started, control node will start many processes, and each process can use a specific client to access our distributed system. A generator generates a series of operations for each process, such as get / set / CAS, to be executed. Each operation will be recorded in the history. While performing operations, another nemesis process will try to destroy the distributed system, such as disconnecting the network connection with IPtable. When all operations are completed, Jepsen will use a checker to analyze and verify whether the system behavior meets the expectations. Tang Liu, the chief architect of pingcap, wrote two articles on how we actually use Jepsen to test tidb. You can search it. I won’t expand it in detail here.

  • FoundationDB

    • It is difficult to be deterministic

      • Random

      • Disk Size

      • File Length

      • Time

      • Multithread

Foundationdb is the predecessor. It was acquired by apple in 2015. They have done a lot to solve the problem of error injection, or how to reproduce it. A very important thing is deterministic. If I give you the same input and run several times, can I get the same output? This sounds very scientific and natural, but in fact, most of our programs can’t do it. For example, do you judge that there are random numbers in the program? Is there multithreading? Is there any disk space? Is there any time to judge? Are you still the same when you judge again? You run again, the same input, but the behavior is different. For example, you generate a random number. For example, you judge the disk space. This judgment may be different from the next judgment.

So they spent about two years building a library in order to “I can give you the same input and get the same output”. This library has the following features: it is single threaded, and then pseudo concurrent. Why? Because if you use multithreading, how can you turn the same input into the same output? Who gets the lock first? There are many problems, so they choose to use single thread, but single thread itself has the problem of single thread. And for example, if you use go language, it is also concurrent if you use single thread. Then its language specification tells us that if a select acts on two channels and both channels are ready, it will randomly one, that is, on the specification defined by the language, it is impossible for you to get a deterministic. Fortunately, foundationdb is written in C + +.

  • FoundationDB

    • Single-threaded pseudo-concurrency

    • Simulated the implementation of all the external communication

    • Determinism

    • Disasters happen more frequently here than in the real world.

In addition, foundationdb simulates all networks, that is, the two think they communicate through the network, right? In fact, it communicates through a set of things simulated by itself. There is a very important point in it, that is, if the disk is damaged, the probability of occurrence is 8% in three years, then the probability of occurrence in the user is 8% in three years. However, once the user appears, it will prove to be very serious, so what are their ways to deal with this problem? I make it happen all the time through my own simulation system. They generate disk damage every two minutes, that is, it is hundreds of thousands of times more likely than in reality, so it thinks that the technology it adjusts is more frequently, that is, my errors occur more frequently. What is the probability of network card damage? This is extremely low, but you can use this system to generate it every minute. In this way, the probability of your system encountering this error is much, much higher than in reality. Then you can reproduce it. For example, in reality, you can reproduce it every three years. You may reproduce it every 30 seconds.

But what is the most terrible thing about a bug? Is that it can’t be reproduced. I found a bug and later said I fix it, but it can’t be reproduced. Did you fix it? I don’t know, this thing becomes very terrible. Therefore, the reproduction can be guaranteed through deterministic. As long as I replay my input once, I record it every time, and then as long as it has appeared, I replay it, it will appear. Of course, the price is too high, so now the academic community is taking another road, not completely deterministic, but I only need it reasonable. For example, it’s good that I can reproduce it in 30 minutes. I don’t need to reproduce it in three seconds. Therefore, each previous step has to pay a corresponding cost.

To be continued