Operation and Maintenance Master: Summary of Troubleshooting Experience


Tencent Blue Whale Zhiyun, referred to as Blue Whale, is a set of PaaS development framework developed and used by Tencent Interactive Entertainment Group (IEG) for building an integrated enterprise R&D and operation system. This system not only provides a foundation Unattended service for operation and maintenance (posting changes, monitoring and processing, value adjustment, data extraction, etc.), and also provides solutions (tools) for operation and maintenance personnel, and adjusts it at any time to avoid repetitive operation services.


It can be said to be one of the most intense, difficult, and intense jobs in the world when troubleshooting seemingly unstructured problems, especially in the face of extremely high-income businesses and massive service operations, which brings a great sense of panic and triggers adrenal glands. As factors soar, the presence of stress can induce us to make low-level mistakes. To overcome this idiotic instinct, we need to restrain our rage and force ourselves to try one by one in a methodical manner. In fact, doing operation and maintenance training is a kind of mentality. It is calm enough to deal with problems without being chaotic, and it is true to deal with it calmly.

Troubleshooting the problem and finding the root cause to solve it, I personally think it is a very fulfilling thing. Someone once asked me: "How did you think the problem occurred in xxx? How did you confirm that the root cause was xxx?", I could only answer lightly: "depending on experience", and then I felt that this forced pretence was okay. In fact, the term "relying on experience" here is very vague. For a long time, everyone may think that troubleshooting problems should rely on experience, but it is impossible to say what kind of experience is used to troubleshoot problems. In the end, the troubleshooting problems gradually become a problem. Door Mystery. In fact, troubleshooting work often follows some general and unwritten practical rules. It is not a so-called mystical theory. Combined with my own experience and summary, I hope it can help everyone's actual work.

Operation and Maintenance Master: Summary of Troubleshooting Experience

From entering the industry to the present, we have encountered various and strange problems. However, each business form and system are different. We can often search for solutions to one or a class of problems, but personally feel that cognitive methods, Experience is difficult to replicate, so I draw (set) like (road) to talk about the methodology of "troubleshooting", hoping to resonate with you more.

1. Troubleshooting is like solving a case

Operation and maintenance troubleshooting online problems is like a police solving a case. It is a process of constantly analyzing clues and reasoning, but before preparing to troubleshoot problems, we should understand three things:

Cognition is almost the only essential difference between people. —— Fu Sheng's "Cognitive Upgrade Trilogy"

  • It is normal for the system to malfunction
    Today's computer systems have become extremely complex. A user request may go through sending requests, DNS resolution, operator networks, load balancing, servers, virtual machines (containers), and components may be called depending on the complexity of the business logic. Cache, storage and database etc. Problems may occur in each link, and some components are distributed, which greatly increases the difficulty of troubleshooting, so don't panic and maintain a good attitude when problems occur.
  • The first task is to restore the system
    "In the event of an emergency, the pilot's primary task is to keep the aircraft flying. Compared with ensuring the safe landing of passengers and the aircraft, fault location and troubleshooting are secondary goals." Therefore, restoring the online system is the primary task, not finding it immediately. the reason for the occurrence.
  • There is always only one truth
    Computer is a science, and the computer world is composed of 0 or 1. In this world, there is only yes or no, there is no middle ground, so in the computer world everything has a root cause, nothing happens by chance, everything is inevitable .

2. Understand the case and assess the size

First assess the scope of the problem, whether it is the whole network, some areas, or a certain link is unavailable, or there are problems in many business lines, and assess the size of the case, whether it is an ordinary civil case or a criminal case case.

3. Sort out clues, organize and analyze

Sort out the information or clues that have been obtained at hand, such as network alarms on monitoring, user feedback that they cannot be accessed, developers feedback that there is a problem with the server, changes have been made at the same time period, etc., try not to miss these seemingly irrelevant Clues, organize these clues first, and analyze them together later.

The process of reasoning is to draw a unique result through reasonable imagination and inference based on known clues. Clues are the starting point of the entire reasoning process. Whether the clues are good or not, and whether there are errors or not, will directly affect the quality of reasoning. Therefore, it is the most basic and important part. The most common mistakes in sorting out clues are insufficient information and subjective assumptions.

4. Expand Your Information

Actively expand the reception of information, such as asking developers or algorithm students, whether there are online changes today, and whether there are major adjustments in the network team. Obtaining valuable information points from it is crucial for troubleshooting. Viewing monitoring, taking a closer look at the changes of a monitoring item, tracking logs and debugging information are all means to expand the amount of information.

Expand your knowledge and learn more about related systems in your spare time, such as architecture, deployment, logic, etc. Once a fault occurs, the discussion can also provide you with ideas for solutions, draw inferences from other facts, and promote the investigation and resolution of the problem.

5. Analyze testimonies to identify right and wrong

If it is an external problem, such as business complaints, user feedback and other information, sometimes it is credible, and sometimes it is not credible. For example, there were problems with the development feedback effect before. It's normal, let's help check the problem of the system, but in the end it was caused by the code calling a dynamic configuration. Sometimes the feedback information is information filtered and processed by the describer. His investigation and analysis may lead you astray. When collecting information, you need to analyze everyone’s testimony with a scrutiny and skeptical attitude.

Everyone's learning ability is actually very strong. With the accumulation of experience, the ability to screen testimony will gradually improve.

Six, see the nature of the problem

"When you hear the sound of hooves, guess the horse, don't guess the zebra." When you see a phenomenon or a thing, look at the substance rather than just the surface. It's here for something other than guessing if it's a zebra or a white or a black horse.

The same is true of troubleshooting problems. Sometimes seemingly impossible and extremely simple things may be the final cause. Don’t easily rule out a certain cause, such as “SSD data errors caused by cosmic rays”.

A long time ago, I encountered a problem of high time consumption of a certain svr. After checking for a long time, I did some tuning and it still did not work. Finally, I found that the network card was actually full.

Seven, determine the direction, carry out positioning

Determine the investigation direction, such as from large to small, from top to bottom to check the steps, from large to small, first check whether there are problems in macroscopic areas such as IDC network, computer room status, etc., eliminate them one by one, and gradually narrow the scope of the problem. From top to bottom, check the call chain at the top of the phenomenon one by one, and gradually deepen down.

Not all problems are from big to small and from top to bottom. Only when macro problems reach a certain level will they cause "qualitative changes", thus attracting attention. In the process of leading to qualitative changes, your business may have been affected by some The performance is very clear. At this time, microscopic analysis is required, and then gradually to the macroscopic diagnosis.

8. Summarizing records and filing cases

A good memory is not as good as a bad writing. However, in the midst of a chaotic problem analysis, it is indeed impractical for the operation and maintenance to calmly record the problems and judgments. But even so, we can still keep a piece of analysis data after the matter is over, summarize and record the execution steps and solutions in the processing process, which can help ourselves and the team accumulate valuable processing experience.

The above method and process are translated into operation and maintenance terms:

Operation and Maintenance Master: Summary of Troubleshooting Experience

Nine, eat a cut, grow a wisdom

It’s not scary to have a problem. What we are afraid of is that we won’t learn anything from the problem. What we are afraid of is that similar problems will recur, improve the efficiency of problem location, and what are worth doing, such as:

  • Establish a long-term error code mechanism, and use figures with statistical and visual significance to briefly describe the meaning and scope of errors. As the so-called concentration is the essence, this has been tried and tested in error codes.
  • The main purpose of typing error logs in normal programs is to better troubleshoot and solve problems, and to provide important clues and guidance. However, in practice, the content and format of the error log are varied, and the error message may be incomplete, without relevant background, or unclear, making troubleshooting and solving problems a very inconvenient or time-consuming operation. In fact, as long as the development is a little attentive, it will also reduce a lot of useless efforts to troubleshoot the problem. How to write an effective error log and establish a log standard is also very beneficial to problem analysis.
  • Locate the problem to avoid secondary damage. When a seemingly elusive problem occurs, the instinct may be to restart and restore the system to normal as soon as possible. While this approach often solves the problem and works quickly, it also has the potential to push the situation into an unbelievable abyss. Troubleshooting methods include restarting unstable systems, trying to automatically log databases, file system repairs, etc. These methods often do solve problems and bring the system back to production, but at the same time, they may also cause data recovery efforts to be wasted and destroyed. Eliminate the opportunity to determine the root cause of a problem and even significantly extend the downtime of critical systems. Retaining the scene is also very important. It is the same as requiring on-site investigation, sample collection, investigation, and locking at the crime scene. For problems that are difficult to reproduce, try to create conditions to retain the data or the scene that can be used for fault reproduction.

The online environment is complex and changeable, although this does not play a direct role in solving the problem immediately, but adhering to this approach, creating conditions for development and testing, and reducing the suspension rate of difficult-to-reproduce problems will ultimately help. long-term stability of the business.

  • Establish a centralized data visualization platform, and start analysis without encountering problems. If you do not have enough understanding of the business and no data dependence, it is likely to make problems worse.
  • Establish a sandbox shadow system to simulate the complex and changeable live network environment, avoid online impacts, and reproduce or stress test problems, such as: TCPCopy, DubboCopy, etc.
  • Build an open source log visualization solution to help us solve the last "mile" problem, such as: ELK, Log.io, etc.
  • To do good things, you must first sharpen your tools. Common system troubleshooting tools Perf, IPTraf, Netperf, TcpDump, GDB, Pstack, jstack, strace, top, iotop, Tsar, etc.
  • … …

10. Conclusion

Summarizing some ideas and experiences in dealing with problems in the past few years, we can summarize and extract the following sentences:

Collect information and record it at any time;
Coordinate resources and control impact;
Calm judgment, calm analysis;
Make bold assumptions and try carefully;
Positive summaries for later use.

Operation and maintenance experts may be the dream pursued by every operation and maintenance person. Their keen sense of smell always seems to be able to find out the root cause of system failures. This ability to react quickly and locate accurately comes from years of experience and personal knowledge in dealing with complex system problems, and its success is difficult to replicate. Although no institution is willing to issue certification qualifications, it is still a "supernatural" ability that everyone is happy to pursue.

This article shares some experiences and experiences in the operation. The methods, experiences and ideas presented herein do not represent best practices.

Introduction to Blue Whale Zhiyun

Tencent Blue Whale Zhiyun (Blue Whale for short) software system is a set of PaaS-based technical solutions, dedicated to building an industry-leading one-stop automated operation and maintenance platform. At present, the community version and enterprise version have been launched, welcome to experience.
Please click to visit the official website of Blue Whale:http://bk.tencent.com