Station B collapsed. How to prevent similar accidents?


As we all know, although I am a programmer, I love sports very much, such as dancing. I don’t go home every day. Before going to bed, I will learn relevant dances in the dance area of station B.

Station B collapsed. How to prevent similar accidents?

Yesterday was no exception. As soon as I finished washing, I rushed to sit in front of the computer and opened the dance area of station B to learn how to bite meow. If Xin Xiaomeng and Xiaoxian had new dance moves, I had to say that my wives danced very well. Even introverts like me sprang up unconsciously.

Just as I was about to learn the next action, I found out how 404 not found.

No, as a developer, my first intuition was that the system broke down. I even suspected that it was the problem of my network. I found that the mobile network was normal, and the computer was normal to visit other web pages. I knew that the development was going to take the pot.

Station B collapsed. How to prevent similar accidents?

I refreshed it several times and found that it was still like this. I sympathized with the corresponding development students. It should be gone at the end of the year. (by the time I wrote this article, the website had not been restored)

As a former programmer, I habitually think about the website architecture of station B and the possible problems after the accident. (old professional habit)

First, we can roughly draw the structure diagram of a simple website, and then we can guess where the problem may be.

Because I stayed up late to write articles, I didn’t stay in such a company that mainly relies on live video, and I didn’t know much about the technology stack, so I drew a sketch with the general logic of e-commerce, and everyone sprayed it gently.

Station B collapsed. How to prevent similar accidents?

From top to bottom, from the portal to CDN content distribution, to front-end servers, back-end servers, distributed storage, big data analysis, risk control to search engine recommendation, I just drew it casually. I think the overall architecture should not be very different.

I checked some companies like Betta, station B and station a on the Internet. The main technical stacks and technical difficulties are as follows:

Video access storage

  • flow
  • Nearest node
  • Video codec
  • Breakpoint continuation (much worse than the IO example we wrote)
  • Database system & file system isolation

Concurrent access

  • Streaming media server (all major manufacturers have it, and the bandwidth cost is large)
  • Data cluster, distributed storage, cache
  • CDN content distribution
  • load balancing
  • Search engine (segmentation)

Barrage system

  • Concurrency, thread
  • kafka
  • NiO framework (netty)

In fact, it is similar to the technologies we all learn, but the language composition of their corresponding microservices may account for a large proportion of go, PHP, Vue and node.

Let’s analyze the possible causes and places of the accident:

1.Delete library and run

Station B collapsed. How to prevent similar accidents?

This has happened in Weimeng before. I don’t think companies will give so much permission to O & M. for example, the host permission directly prohibits commands such as RM RF, fdisk and drop.

Moreover, the probability of the database is multi master, multi slave and multi backup. Disaster recovery should also be done well. Moreover, if the database explodes, many static resources of the CDN should not be unable to load, and the whole page is directly 404.

2.Single and micro services hang up and bring down large clusters

Station B collapsed. How to prevent similar accidents?

At present, the front and rear ends are separated. If the back end is hung, many things in the front end can still be loaded, but the data can not report errors. Therefore, if the cluster needs to hang, the front end may hang, or the front and rear ends may hang together, but it is still the problem. Now it seems that all static resources can not be accessed.

However, I think this point is also a little possible, because some services hang up, resulting in a large number of error reports and hanging clusters. In this way, people will constantly refresh the page and make it more difficult to restart other services, but this possibility is not as likely as I said in the end.

3.The server manufacturer has a problem

Station B collapsed. How to prevent similar accidents?

This kind of large websites are CDN + SLB + station clusters, and all kinds of current limiting, degradation and load balancing will be done well. Therefore, it is only possible that the hardware of the server manufacturers of these front-end services has a problem.

However, I am puzzled that the BFF of station B should be routed to the computer room where some access nodes are more advanced. In this way, when small partners across the country brush, it should be that some people are good, some people are bad, and some people are good and sometimes bad. But now it seems that it is all bad. Do they bet on a node area of a manufacturer?

I think it is also said on the Internet that the cloud sea data center is on fire. I don’t know whether it is true or not. I can only wait to wake up and see the official announcement of station B. in principle, station B should take a lot of guarantee measures from CDN, distributed storage, big data and search engine. It’s really unwise if it’s all in a place.

My feeling is that there is a problem with the offline server because all the offline servers are not on the cloud. It happens that the key business is not on the cloud. Now the company is used with a hybrid cloud such as public cloud and private cloud, but the private cloud part is the internal business of station B, so there should be no problem with its own computer room.

If, as I said, I bet on a server manufacturer and there is a problem with the physical machine, the data recovery may be slow. I used to do big data. I know that the data backup is incremental + full. When recovering, it is really good. Some can be pulled from nodes in other regions, but if it is placed in one place, it will be troublesome.


I think no matter what the final reason is, what our technicians and companies should think about is how to avoid such things.

Data backup:Backup must be done. Otherwise, if there is any natural disaster, it will be very uncomfortable. Therefore, many cloud manufacturers now choose places where there are few natural disasters, such as my hometown in Guizhou, or at the bottom of the lake and the seabed (it is cooler and the cost can go down a lot).

The full volume and incremental data are basically always done. The continuous incremental data every day, week and month, as well as the full volume data backup on time, can greatly reduce the loss. I’m afraid that the mechanical disks in all regions are broken (disaster recovery in other places can be found except for the destruction of the earth).

The operation and maintenance authority converges. I’m still afraid to delete the database and run away. Anyway, I often use Rm-rf on the server, but generally, if I can enter only with a springboard machine, I can make a command to prohibit it.

Upper cloud + primary cloud:The various capabilities of cloud products are now very mature. Enterprises should have enough trust in the corresponding cloud manufacturers. Of course, they have to choose the right one. The various capabilities of cloud products are one of them. There are disaster recovery and emergency response mechanisms at critical moments that many companies do not have.

Cloud native is a technology that we have only paid attention to in recent years. Docker + k8s, together with various capabilities of cloud computing, can actually achieve unattended, dynamic capacity expansion and emergency response, but the technology itself needs some trial costs, and I don’t know whether a video based system such as station B is suitable.

Self strength:In fact, I don’t think we can rely too much on many cloud manufacturers whether we go to the cloud or not. We still need to have our own core technology system and emergency mechanism. What if cloud manufacturers are really unreliable? I think enterprise technicians need to think about how to make real high availability.

For example, many cloud manufacturers sell one physical machine into multiple virtual machines, and then there will be a situation of single physical machine and multiple hosts. If one of them is e-commerce playing double 11 and the other is a game manufacturer, and the other takes up a lot of network bandwidth, you may have packet loss, which is a very poor experience for game users, That’s why I said don’t trust and rely too much on cloud vendors.

If the other party buys it and goes mining, it will be even more excessive. It will drain the computational power and make it more difficult to run at full load.

Station B this time, fortunately, such problems were exposed in advance, and in the evening, there should be a lot of time to recover from the low traffic. When I wrote here, most of the web pages were recovered, but I found that they were partially recovered.

Anyway, it can be completely eliminated next time. I believe station B will be busy with architecture transformation for a long time to ensure its real high availability.

I hope I can see the dance area stably in the evening, instead of staring at the 2233 niangs of 502 and 404 in a daze.

Station B collapsed. How to prevent similar accidents?