Distributed system testing — destruction and reconstruction of confidence

Time:2021-11-24

This topic series is compiled from the live record of the topic “deep exploration of distributed system testing” shared by Liu Qi in the 26th issue of pingcap newsql meetup. The article is long. In order to facilitate your reading, it will be divided into upper, middle and lower articles. This article is the next article.

Continued from part II:
Scylladb has an open source thing, which is specially used for failure injection for file systems. Its name is charybdefs. If you want to test your system, you can test and simulate where the file system keeps going wrong, such as writing to the disk fails, the driver fails to allocate memory, the file already exists, and so on.

CharybdeFS: A new fault-injecting file system for software testing

Simulate the following errors:

  • disk IO error (EIO)

  • driver out of memory error (ENOMEM)

  • file already exists (EEXIST)

  • disk quota exceeded (EDQUOT)

Let’s take a look at cloudera. The following figure shows a failure injection structure of the whole cloudera.

Distributed system testing -- destruction and reconstruction of confidence

One side is tools, and the other side is its entire level division. For example, in the whole cluster, there are many hosts on the cluster, and various services run on the host. The whole system is mainly used to test HDFS, and HDFS is also working hard to do effective testing. Then, an agentest is deployed on each machine to inject possible errors.

See how powerful they are.

Cloudera: Simulate the following errors:

  • Packets loss/corrupt/reorder/duplicate/delay

  • Bandwidth limit: Limit the network bandwidth for the specified address and port.

  • DNSFail: Apply an injection to let the DNS fail.

  • FLOOD: Starts a DoS attack on the specified port.

  • BLOCK: Blocks all the packets directed to 10.0.0.0/8 (used internally by EC2).

  • SIGSTOP: Pause a given process in its current state.

  • BurnCPU/BurnIO/FillDISK/RONLY/FIllMEM/CorruptHDFS

  • HANG: Hang a host running a fork bomb.

  • PANIC: Force a kernel panic.

  • Suicide: Shut down the machine.

Packets can be lost, broken, and reordered. For example, if you send an A and then a B, it can give you reorder. It becomes B and then a, and then see if your application handles this behavior correctly. Then send it once, and then resend it to you, and then you can delay it. This is relatively simple. At present, most of them are implemented by tikv, and there are bandwidth restrictions. For example, compress your bandwidth to 1m. What is too laggy is too laggy. We found a problem that we found in the Redis. But Redis is shared by many users. A user can fill up the entire Redis bandwidth. So the bandwidth of other people is very stuck. What is the Redis’s behavior when this card is very high? We don’t need a user to really fill it up. As long as you use this tool, it will appear in an instant. I limit your bandwidth to the original 1%. Assuming that others are competing with you for bandwidth, what is your program behavior? You can come out right away, and you don’t need a very complex environment. This greatly improves the testing efficiency and can test many corner cases at the same time.

Then DNS fail. What will be the result of DNS fail? Have you measured it? I may not have thought about this problem, but in a real distributed system, every point may go wrong. And floor, suppose you are attacked now, what is the behavior of the whole system? What should I do if I accidentally get blocked by this IP table. We do have this situation. As soon as we came up and connected, 20000 connections were called out, and then we found that most of them could not be connected. Later, we saw that the IP table automatically enabled a mechanism, and then blocked you all. Of course, it took us about half an hour to find out the problem. But this should actually be something that should be considered at the beginning of the design.

If your process is suspended, for example, everyone runs in the VM on the cloud. In order to upgrade, the whole VM suspends you first. What happens when you are restored after the upgrade? In short, if you assume that your program has GC, and GC has stuck our program for five seconds, is the program behavior normal? Fifty seconds? This is very interesting. Burncpu is to write another program, take up all the CPU, and then let your current program use only a small part of the CPU. Is the behavior of your program normal. Normally, you may say that my CPU is not the bottleneck. My bottleneck is Io. When others rob you of the CPU and press your CPU very low, when the CPU is the bottleneck, is this behavior of your program normal? There is also IO, which grabs the resources you read and write, and then filedisk fills the disk with little writing space. For example, what happens when the database is full when you create your redo log? Then I suddenly set the disk as read-only, and you suddenly make a write error, but is your normal read-write behavior right? For a typical example, if you write to a database and the disk is full, whether the external read request can respond normally. Fill memory is to compress this memory in an instant, so that you may not be able to distribute memory in the next malloc. This is related to the business, that is, destroying HDFS files. The others are hang and panic, and then suicide. Turn off the machine directly. What is the behavior of the whole system?

Now the more painful point is that everyone does their own things. Each family makes one set, but there is no way to make a common thing for all people. We have also made a set ourselves, but we really can’t share with other languages. The libfu library mentioned earlier is actually written in C language, so all C related can call that library.

Distributed testing

  • Namazu

    • ZooKeeper:

      • Found ZOOKEEPER-2212, ZOOKEEPER-2080 (race): (blog article)

    • Etcd:

      • Found etcdctl bug #3517 (timing specification), fixed in #3530. The fix also resulted a hint of #3611, Reproduced flaky tests {#4006, #4039}

    • YARN: Found YARN-4301 (fault tolerance), Reproduced flaky tests{1978, 4168, 4543, 4548, 4556}

Then NAMAZU. Everyone must think zookeeper is very stable. Facebook is in use, Ali is in use and JD is in use. We all think that this thing is also very stable until the tool appears, and then we can easily find bugs. In fact, there are many bugs in all the particularly stable systems we think. This is a thing that destroys the three outlooks, that is, you think things are very stable and stable, but in fact they are not. From the above, we can see several bugs of etcd found by NAMAZU, and then several bugs of yarn. In fact, there are others.

How TiKV use namazu

  • Use nmz container / non-container mode to disturb cluster.

    • Run container mode in CI for each commit. (1 hour)

    • Run non-container mode for a stable version. (1 week+)

  • Use extreme policy for process inspector

    • Pick up some processes and execute them with SCHED_RR scheduler. others are executed with SCHED_BATCH scheduler

  • Use [0, 30s] delay for filesystem inspector

Next, let’s talk about some experience of using NAMAZU for tikv. Because we used to write to the disk on the system and on the cloud for more than 50 seconds, we need special tools to simulate the jitter of the disk. Sometimes a write may take a long time. Is this OK. If you can use all these things, I think you can find a lot of bugs for many open source systems.

Let’s briefly introduce the basic strategy we are running now. For example, we will use the delay of 0 to 30 seconds (that is, every time you interact with the file system, such as reading or writing, we will generate a random delay of 0 to 30 seconds). However, we normally need to measure the delay of 30 seconds to a few minutes, Whether the whole system will collapse.

How TiKV simulate network transport

  • Drop/Delay messages randomly

  • Isolate Node

  • Partition [1, 2, 3, 4, 5] -> [1, 2, 3] + [4, 5]

  • Out of order messages

  • Filter messages

  • Duplicate and send redundant messages

How to simulate the network? Suppose you have a network with five machines in it. Now I want to be a brain crack. What can I do? You can’t pull the cable, can you? For example, in the tikv test framework, we can directly split the five nodes into two parts through the API, so that nodes 1, 2 and 3 can be connected with each other, and nodes 4 and 5 can also be connected. The two partitions are isolated from each other, which is very convenient. In fact, the principle is very simple. This situation is simulated by the program itself. If the package you send is automatically lost to you, or you are directly told that it is unreachable, then you will know that the network is cracked, and then what do you do? Only certain types of messages are allowed in and the rest are discarded. In this way, you can ensure that some bugs are bound to reappear. This framework gives us great confidence to simulate and reproduce various corner cases to ensure that these corner cases can be covered every time in unit testing.

How to test Rocksdb

  • Treat storage as a black box.

  • Three steps(7*24):

    • Fill data, Random kill -9

    • Restart

    • Consistent check.

  • Results:

    • Found 2 bugs. Both fixed

Then talk about how we measure rocksdb. Rocksdb is very stable in everyone’s impression, but we recently found two bugs. The test method is as follows: we fill in the data in rocksdb, and then kill it for a random period of time. After killing, we restart and restart to check whether the data we just failed is consistent. Then we find two bugs that may cause data loss, but the official response speed is very fast, and it will fix in a few days. However, we generally run such a stable system. Why is it so easy to find bugs? For this test, if there is always a cover for this test, these two bugs may be found soon.

This is our basic test, that is, as a pure black box. When you are testing the database, it is basically a black box test. For example, MySQL writes data and kills it. For example, when I commit a transaction, the database tells us that the commit is successful. I kill the database and I can check the data I just submitted. This is a normal behavior. If it is not found, there is a problem with the whole system.

More tools

  • american fuzzy lop

Distributed system testing -- destruction and reconstruction of confidence

In fact, there are some more advanced tools. What we usually think is particularly stable can’t be destroyed. Nginx, ngpd, tcpdump, libreoffice, if Linux is useful, students may know, as well as flash and SQLite. As soon as this thing came out, everyone was very excited. They said how they found so many bugs at once and why the previously stable system was so vulnerable. They would think this thing is quite intelligent. For example, there is an if branch in your program. It is like this. If your program has 100 instructions, it goes straight from the front. When it comes to a branch instruction, it continues to explore. If a branch can’t go on, it will continue to explore here, and then give you random input until I explore it, I wrote it down. Next time I know I can go into a specific branch with this input. Then I can go down again. For example, after you enter the if branch, there is if in it. Your traditional means may not detect it, but it can. It records that I can go in, and then I start again. Anyway, I continue to enter this, and I go in again. Once I detect a new branch, I remember, and I go in again. So when it came out, everyone said it was really powerful and found so many bugs at once. But these people are not the most excited. Hackers are the most excited. Why? Because suddenly, many stack overflow and heap overflow vulnerabilities have been found, and then you can write a bunch of tools to attack so many systems on the line. Therefore, many technical advances were made by hackers in the early days, but their purpose is not necessarily to test bugs, but to hack a system. This is what they did at that time, so this tool is also very powerful and interesting. You can take it to study your own system.

Everyone was impressed that all kinds of file systems were very stable, but they were stunned when tested with American Fuzzy LOP. Btrfs knelt without holding on for 5 seconds. Ext4, which we use most, is the strongest. It has only resisted for two hours!!!

Distributed system testing -- destruction and reconstruction of confidence

Let’s talk about Google. Google doesn’t talk much about how to test. Recently, chrome team has opened source their fuzzy testing tool OSS fuzzy. The strength of this tool lies in its excellent automation:

  • Automatically create an issue when a bug is found

  • Automatically verify after the bug is solved

What’s more amazing is that OSS fuzzy clusters can run ~4 trillion test cases a week. For more details, you can see this article:Announcing OSS-Fuzz: Continuous Fuzzing for Open Source Software

In addition, some tools can make the life of distributed system developers a little better.

Tracing tools may help you

  • Google Dapper

  • Zipkin

  • OpenTracing

There is also tracing. For example, I use a query, go through so many layers, go through so many machines, and then how long it takes in different places and different links. In fact, in the distributed system, there is a special thing for tracing, that is, distributed tracing tools. It can use a line to express how long your request takes in each stage. If there are several segments, it can be divided into several machines and how long it takes to be parallel. The general structure is as follows:

Distributed system testing -- destruction and reconstruction of confidence

Here is a specific example:

Distributed system testing -- destruction and reconstruction of confidence

It’s very clear. You can see it at a glance. You don’t have to look at the log. In fact, this is not new at all. Google made a distributed tracking tool more than ten years ago. Then the open source community wants to make an implementation called Zipkin, which seems to be written in Java or something. A new one is called opentracing, which is written by go. We are now preparing to use this system to track the response time of tidb requests at various stages.

Finally, I would like to say that after you study the system and find many bugs, don’t lose confidence in the system. After all, bugs have always been there, but they have not been found before, but now they are found much more. On the whole, the new testing methods make the quality of the system much better than before. It seems a little overtime. Let’s stop here first. There are still many details that can’t be carried out. We’ll talk about it next time.
(end of this series)