Distributed system testing: Ideas


This paper is based on the on-the-spot record of the topic “deep exploration of distributed system testing” shared by Liu Qi in the 26th issue of pingcap newsql meetup.
The article is longer, in order to facilitate reading, will be divided into three parts, this article is the first.

Today, we mainly introduce distributed system testing. For the current situation of pingcap, we think it is more difficult to do a good job of distributed system testing than to do a distributed system. It is not the most difficult to write it out, but the most difficult to test it well. Do you think that’s exaggeration? Let’s start with the simplest Hello world that everyone can write.

A simple “Hello world” is a miracle

We should walk through all of the bugs in:

  • Compiler

  • Linker

  • VM (maybe)

  • OS

In fact, it is a miracle that the Hello world can run correctly every time. Why? First of all, there must be no bug in the compiler and no bug in the linker; then we may run on the VM, and the VM has no bug; and hello world has a syscall, so we have to ensure that there are no bugs in the operating system; not to mention it, we have to ensure that there are no bugs in the hardware. So a simplest program can run normally. We have to go through a path of huge length, and then everything in this path can’t go wrong. Only then can we see the simplest Hello world.

But in the distributed system, it is more complicated. For example, you are now using a very typical micro service. Suppose you provide a micro service, and then the function provided in the micro service is to output a hello world, and then let others call.

A RPC “Hello world” is a miracle

We should walk through all of the bugs in:

  • Coordinator (zookeeper, etcd)

  • RPC implementation

  • Network stack

  • Encoding/Decoding library

  • Compiler for programming languages or [protocol buffers, avro, msgpack, capn]

So we can take a look at its path. At least we need to rely on the coordinator to do such service discovery, such as zookeeper and etcd. We will feel that this thing should be stable, right? However, you can check their release notes every time. What bugs are said in the fix? All these things that you think are very stable have been upgraded all the time. There will be bug fix in every upgrade. But in other words, we are very lucky, because most of the time we don’t encounter that bug, and then the implementation of RPC can’t have any problems. Of course, if you use RPC in depth, such as grpc, you will find that there are many bugs. If you use RPC more deeply, you will find that it has bugs. There is also the system network protocol stack. Last year, TCP was found to have a checkup problem, that is, the TCP protocol stack of Linux, which will never go wrong in my impression. Moreover, if you have the experience of go, you can look at the updated records of go JSON since its release, and you will find some bugs. For example, the code that we need to generate by using the compiler and the code that we don’t like, such as the code that we need to generate by using the compiler and compiler. After that, we can almost run this program. Of course, we don’t consider the bugs in the hardware itself.

In fact, a correct running program is very lucky in terms of probability. Of course, every system is not perfect, so in general, why do we run smoothly? Because our test always found the right path, we ran a simple test must be the correct path, but there are many wrong paths in this, in fact, we did not encounter. Then I don’t know if you have any impression. When you write a go program, the error handling is usually written as if err! = nil, and then return error. I don’t know how many people have written. In other programs, in other languages try.catch And then all kinds of errors are handled. It is a really perfect system. The final error handling code is usually more than the normal logic code you write, but our test usually covers the correct logic, that is, in fact, the cover we test is a small part.
Let’s first correct a few ideas about testing. How can we get a good, high-quality program, or a high-quality system?

Who is the tester ?

  • Quality comes from solid engineering.

  • Stop talking and go build things.

  • Don’t hire too many testers.

    • Testing is owned by the entire team. It is a culture, not a process.

  • Are testers software engineers? Yes.

  • Hiring good people is the first step. And then keep them challenged.

Our idea is that solid engineering comes first. I think this is almost beyond doubt. What is your experience? Then there is another is not bullshit, as soon as possible to build things up, and then let things run. Some time ago, I also wrote a paragraph, which is: “you write rust, and he writes Java. You have talked for so long. People’s program of rust (slow compilation) has been compiled, but you haven’t started to write java yet.” The original version is like this: “you are a woodcutter, he is a sheep herder. You chatted all day, his sheep were full, where is your firewood?” And then there’s a particularly controversial topic recently: what should CTO do. There are different opinions about whether CTO should write code or not. Because everyone is limited by his own environment, everyone’s views are different. I think it’s a little bit like that, it’s the same chat, and then different people have different views.

Test automation

  • Allow developers to get a unit test results immediately.

  • Allow developers to run all unit tests in one go.

  • Allow code coverage calculations.

  • Show the testing evolution on the dashboards.

  • Automate everything.

One of the interesting things for us now is that pingcap does not have a tester so far, which may seem incredible to all companies. Why should we do this? Because it’s impossible for us to test by people now. How complicated is it? Let me talk about a few basic figures. Let’s feel it: we have more than six million tests now, which are fully automated. And then we have a large collection of ORM tests from the community, which I’ll talk about in a moment. Test can’t be written by people before, but it can’t be written by many people. For example, if you are given a legal syntax tree, you can make an output according to this syntax tree. For example, you can change the variable name, change its expression, and so on. You can generate a lot of such SQL.

Google spanner uses this feature, it will have a special program to automatically generate statements that conform to SQL syntax, and then give it to the system to execute. If the crash occurs during the execution, there must be a bug in the system. However, there is another problem in this area, that is, you have generated a legal SQL statement, but you do not know the execution structure of the statement. How do you judge whether it is correct? Of course, there are smart people in the industry. I throw it to several databases, run it at the same time, and then get a few consistent results, which I think is basically right. If a statement comes over and the result of executing it on my side is different from that of others, it means that I must be wrong. Even if you are right, it may also be wrong, because other people carry out this result. If you are different, everyone will think you are wrong.

So it is very important how to generate tests automatically when testing. Last year, there was a new saying in the United States called “how to find a bug while you sleep.”. In fact, the important thing about testing is that automated testing can find bugs while you sleep. It seems that we also mentioned fault injection and fuzzy testing. Then all the people who test are engineers, because only in this way can you not shake the pot.

This is one of the things that we firmly believe that all tests must be highly automated and completely free from human intervention. Then the most important thing is to hire the best people and challenge them. If there is no challenge, these people will be idle and distracted, and then it will be difficult to work together. Now, what is this social characteristic? That is, for complex engineering, a large number of excellent talents are needed. If excellent talents are not applied to one place, the complex engineering can not be done. I saw Loongson today. It has been ten years since Loongson was built. It is almost as good as Intel’s processor. They must have excellent talents, but at present, we still have to admit that there is still a big gap between our hardware and foreign countries. In fact, there is a big gap in software. For example, we are seven years behind spanner, and spanner has been used in Google on a large scale in 2012. We have always admired these excellent works.

I’ve just emphasized automation over and over again. I don’t know how much cover you usually write? If cover is always below 50%, that is to say, half of your code is not detected, then it may have problems online any time. Of course, we also need a better way to play back the online case before going online. If you don’t update the theory online and offline for a long time, the more secure it will be. For example, if you have been running on it for two months, and the business has been modified a little, but the two have not covered the modification, there may be new problems at this time. So we need to automate everything, including the monitoring just now. For example, as soon as you go through a system, it will automatically find out which items need to be monitored, and then automatically set the alarm. Do you think it’s amazing? In fact, this is a common thing in Google, and pingcap is also doing it now.

Well… still not enough ?

  • Each layer can be tested independently.

  • Make sure you are building the right tests.

  • Don’t bother great people unless the testing fails.

  • Write unit tests for every bug.

It is not enough for you to divide the whole system into many modules and test them in one layer. It’s also important that we found an interesting thing in the early days. We built a lot of tests, and then our programs passed a lot of tests easily. Later, we found that one test was wrong. What does that mean? It means that our program has always been wrong, because test will cover you. So until later, we once felt that we had written a correct code, but the result was not correct. At this time, we went to check again and found that a test was written incorrectly. So a correct test is very important, otherwise you will always be buried in the error, and then buried in the error, because it tells you that it is right.

And why automate? It’s that you don’t disturb these smart people. They are very smart. If you have nothing to do with them, don’t disturb them. If you say “come, come here and give me a test”, then they will continue to disturb them at this time, which will affect their play and influence them to do their own challenges.

This is very important. For all the bugs that have occurred once in history, you must write a test to cover it. Then we should have known this rule. I think the age of the person I’m in today should have seen the star arrow of the saint fighter, right? There is a characteristic of this Saint fighter. All effective moves can only be used once. The same is true of this one. It ensures that you will not be bitten again and will not be trapped again. I think there should be a lot of people fix bugs like this: there is a bug I fix, but there is no test, and then it appears again. Then I feel very strange at this time, and the more accumulated, the more miserable it will be.

This is the practice that the mainstream open source communities are adhering to, with no exception. If an open source community says that I found a bug and I didn’t test to cover it, others would not dare to use it in the future.

Code review

  • At least two LGTMs (Looks good to me) from the maintainers.

  • Address comments.

  • Squash commit logs.

  • Travis CI/Circle CI for PRs.

Let’s talk about code review. It still has something to do with test. Why? Because in code review, you will mention a new PR, and then the PR must pass the test. For example, a typical Travis Ci, or a circleci test. Why do you do this? To ensure that it is merged to the master, you must find out this problem. If it has been merged to the master, first of all, it is not good because you have to reverse it, which is a particularly bad thing in the commit record. The other is that before it goes wrong, you should find out that it is actually the best, because there are many tools that will automatically build according to the master. For example, we will automatically build the docker image according to the master. Once your code is committed to the master, the docker image will come out. Then your users will find that you have a new update and I want to use the new one immediately. However, if you have not had a previous Ci, it will be troublesome at this time. Therefore, if the CI has not been passed, you must not enter the CD stage.

Who to blame in case of bugs?

The entire team.

Another idea to correct is who is responsible for a bug? Most of the people I’ve met are like this, saying, “this bug has nothing to do with me. It’s a bug in his module.”. Pingcap has different opinions. Once there is a bug, it should be the responsibility of the whole team, because you have your own code review mechanism. At least two or more people will look at the code. If there is a problem with this, it must not be the problem of one person.

In addition to some bugs just mentioned, there are some that you can’t define. Are they bugs? How can the system run slowly? Is this a bug? How do you define a bug? The way we define now is that the user has the final say. Although we don’t think this is a bug, it will be slower, but the user said that this thing is too slow. We can’t bear it. This is a bug. You should optimize if you want to. Then there was such a thing in our team, saying “we have run fast, fast enough”, sorry, the user said slow, the user said slow, you have to improve. In a word, the standard can’t be set by yourself. Of course, if you set the standard yourself, it will become “I’m ok. I don’t need to change it. I can.” This is not going to work.


  • Profile everything, even on production

    • once-in-a-lifetime chance

  • Bench testing

In addition, on the issue of profile, we emphasize that even online, we need to be able to do profile. In fact, the cost of profile is very small. Then it is very likely that there was an online system special card. If you restart that card, you may never have a chance to reproduce it. In this case, it is likely to happen once in a lifetime. At that time, if you did not catch it, you might never have a chance to catch it again. Of course, we will introduce some methods to make this reappear, but some of them are really closely related to the business. Then, it may happen to meet a special environment to make it appear. It may be once in a lifetime. You must seize it this time. If you can’t grasp it this time, you may never catch it. Because some crimes are committed only once in a lifetime, and you never get a chance to catch them.

Embed testing to your design

  • Design for testing or Die without good tests

  • Tests may make your code less beautiful

Let’s talk about the relationship between testing and design. Testing must be integrated into your design, that is, when you design, you must think about how this thing should be tested. If you can’t think of how to test this thing in the design, then the correctness of this thing can’t be verified in fact, which is a terrible thing. We think of the importance of testing like this: you either design the test, or you fail, and you have no choice. That is to say, in this area, we put its importance to the highest degree.
(to be continued)

Recommended Today

PHP 12th week function learning record

sha1() effect sha1()Function to evaluate the value of a stringSHA-1Hash. usage sha1(string,raw) case <?php $str = “Hello”; echo sha1($str); ?> result f7ff9e8b7bb2e09b70935a5d785e0cc5d9d0abf0 sha1_file() effect sha1_file()Function calculation fileSHA-1Hash. usage sha1_file(file,raw) case <?php $filename = “test.txt”; $sha1file = sha1_file($filename); echo $sha1file; ?> result aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d similar_text() effect similar_text()Function to calculate the similarity between two strings. usage similar_text(string1,string2,percent) case […]