The first time I encountered a complete collapse of the ECS: CPU suddenly 100%, console cannot be restartedConsole failed to force shutdown。 So far, Tencent cloud engineers and I have not found a specific reason.
The image shown in this article uses the image of GitHub warehouse. If the network speed is too slow, please move to the original address or come to my small station, godbmw.com
1. Case statement
- Location: Tencent student cloud Ubuntu 16.04, broadband 1m
- Time: 13:40, October 9, 2018
- At around 12:37 on October 9, 2018, the CPU utilization rate suddenly increased to 100%, and there was no abnormality in the use of broadband, traffic packets and memory in and out of the intranet.
- At 13:28 on October 9, 2018, the console crashed officially, and the console could not be forced to restart (the cache could not be cleared, the browser could not be replaced, and it could not be forced to shut down).
- Start to submit the work order. The discussion is fruitless. There was no news.
- 2018-10-09 14:00 back to the console, forced restart again succeeded!
- I have re launched my own project and some scripts of the company, and I’m glad that they didn’t cause any loss.
- Check logs and discuss with engineers of Tencent cloud work order. Both parties confirm that there is no problem with the log,Unable to troubleshoot error。
2. Disaster site
First, the personal website cannot be accessed, as shown in the following figure:
First of all, it is not possible to force a restart and a forced shutdown, as shown in the figure below,Please pay attention to the error message above the screenshot：
Tencent cloud’s console has shown that forced shutdown is a physical operation that forces the power to be cut off! It can’t be done. I’m fascinated.
There is about 20 minutes in the process of submitting the work order, and there is no reply. Then it’s about 14:00 on October 9, 2018. After many attempts, you can finally force a shutdown and restart. As shown in the figure:
The above situation, I have to suspect, is that engineers manually turn off the server power:)
3. How to remedy?
After this server event, I realized:How important is the stability of cloud services!The engineers of Tencent cloud didn’t say anything about this incident. I can only guess: the physical problem of the server.
So, this time, I decisively prepared two servers and started to do “load balancing” (and modify the filing). In addition, daily backup data is also very important.
Finally, we hope that the cloud service provider’s services will be as stable as possible. At least in the case of server crash, we can find out the cause of the crash and repair it.