Some thoughts and understandings of high availability


The source of this article is public number: originality zero.


In the current Internet era, under the impact of high concurrency, we must also ensure thatService high availability, if the service is not highly available, it means:

  • The system doesn’t provide services 7 * 24 hours a day, so the user experience is particularly poor. Maybe the user won’t be able to keep the user next time.
  • When the system is not available, it has an impact on the company’s image. Bat technologies like this are symbolic.
  • Most importantly, when the system is not available, the direct loss is money!!! It’s basically seconds lost, vaguely rememberedOn May 28, 2015, Ctrip was paralyzedAccording to the data released by Ctrip’s first quarter financial report, the loss of Ctrip’s downtime is an average of US $1.0648 million per hour.

High availability is very complex, its own level is limited, and it can’t cover so much. It can only be said that it is some of its own thinking and understanding of high availability.

So how to make the system highly available?

We can’t let the server hang up, let the service hang up, so how can we make this inevitable situation without any problem, that is, we can hang up, the service can be broken, so how can the system still provide services?

First of all, if there are a lot of machines and services, even if some of them are broken, there will be no problem. The inevitable situation will be solved. Next, we will analyze step by step. If a specific value is stored in the machine, it cannot be expanded. It must be the hung machine. This is not good. The problem of the machine is easy to be solved. The same configuration is easy to replace. So is the application service. The application service can not store state related values in any machine without storing some special values in itself If there is certain characteristic data, it can’t be easily expanded. Only when each main part is the same, and there is no difference, can we replace it and expand it easily. This is calledStateless service.

If the current service is stateless, then how to make the system dynamically aware of the service hang up? Otherwise, the request or go back to the hung machine, how to transfer to the new machine? Then it may be necessaryService discovery and registration.

If we have reached the above situation, it is enough to deal with the general situation, but the Internet is complex. As we just said, the machine is broken, and the service is broken. What can we do if the network fails for a short time?

So there should beHeartbeat detection,Let’s see if it can be connected regularly (the machine is broken, the service is hung up, and the network is blocked). Anyway, it’s not reachable. This situation can be solved by service registration and discovery, but sometimes the network is broken in a flash. What is the specific situation? For example, just now a service has sent the request to B service, and B service has received the request, then suddenly the network is disconnected, but B service has finished the logic processing, but a service responds that there is no response, and the front desk times out. If it is triggered again, is there any problem if B service does the previous logic again? For example, 200 yuan has been paid. Can we pay another 200 yuan? I need to mention aIdempotencyThe design concept of idempotent is that the result of multiple execution is the same. If there is idempotent design, then it is not afraid of this situation. If there is no feedbackretryYes, there will be no problem.

To achieve the above, there are basically no big problems in dealing with machine failure, service failure, network impassability or flicker. So the Internet is highly concurrent at present. How to improve the system’s ability in the case of high concurrency?

Just like moving things, one person is slow and can help more people together. Because the above architecture can add machines and services, it is easy to think of more machines and services. So it must be faster than the number of machines. For example, there are five machines. So many requests come. What strategy should be used to allocate them to different machines? Through the device, through some software level, but there must be service discovery registration, otherwise there is no way to dynamically know the node changes, as well as the control of some information, black and white list, access frequency, etc.In many cases, adding machines may look low, but sometimes it’s quite effective, but it can’t add machines all the time. In some cases, adding machines can’t solve the problem.

It’s really fast when there are many machines. If there is a blocking method in the service, it’s useless even if there are many services, so we must pay attention toService timeout,Since the service is idempotent, even if it is executed again, there is no relationship. If there is a timeout, it will not affect the later services for a long time (downstream services are down, threads are deadlocked, downstream services are busy, etc.).

For some design patterns of synchronization and asynchrony, in some business scenarios that must be executed in sequence, synchronization must be used. In this scenario that is not necessary, the amount of concurrency with asynchrony must be greater than that with synchronous processing (because middleware goes through many steps, it is not necessarily fast in terms of the total time of a single request, but it is improved from a macro perspective Concurrent requests will be much larger). Just talk about asynchrony. In a service, asynchrony needs to mention multithreading. Multithreading can improve CPU utilization and system performance, but the implementation cost is much higher. How can different services be asynchronized directly? Message Oriented Middleware (it’s very difficult. The first thing is to ensure thatReally asynchronous.Second, guaranteeNo heavy no leakage, it’s really difficult, especially in the case of big data), especially in the case of network I / O, we need to focus on the asynchronous model, but netty encapsulation is very good.

Since every machine or service has a ceiling, if you measure the flood discharge type and it’s not his ability to deal with it, what should you do?

This problem can be seen everywhere in our life. Just after the national day, when we go home and play, we can see this event everywhere. For example, when we pass the security check, a security guard specially takes a card to look at people almost, let the people behind us wait, and so on. There are not many checks to deal with, let the people behind us do it, and then it is similar to waiting. But if there are high-level cars or the car is about to leave, they usually have to pass first. In the software architecture, they should callCurrent limiting, service degradation,Generally, there are two control strategies (1. Reject part of the request, 2. Close part of the service). It may have been mentioned before, but it is not recommended now(After all, it is also the embodiment of the company’s technical strength)At present, the focus is on rejecting some requests. Where is the control added? That’s the one that needs to be controlled. It should be added to each layer.

I vaguely remember a saying in the industry,Three magic weapons of high concurrency and high availability: current restriction, degradation and cache, about cache, you should contact the most. The characteristic of Internet business is to read more and write less, so it is very suitable for using cache.

As a result, the extension of the request in a service is not easy to extend. Moreover, some calls in the unified service are too many, and some calls are relatively few. Because of the continuous division and disassembly, the concurrency can be improved again.

Microservice,There are many concepts of microservices. The first one mentioned is vertical splitting, which is easy to understand. After that, there may be many vertical businesses, which need to continue horizontal splitting. (all the splitting bases here are based on the business of your company, and the deeper you understand, the better).

Through the above, the problems of service can be hung up, machine can be broken, network is blocked or flickering are solved, and the concurrency can be improved to make the service highly available as much as possible.So there are a lot of problems,So we need to solve the problems caused by these modifications:

  • In the past, in a service, transaction control was very easy. After microservice, transaction control is particularly important. In many cases, we can’t achieve strong consistency, but we canFinal consistencyIt’s OK.
  • Call chain monitoring is particularly important, along with early warning.
  • Distributed log is also very important.
  • Advanced jstack and btrace are particularly important in the real environment.

Concluding remarks

My level is limited, inevitably there will be some understanding deviation, if found, welcome to point out actively, thank you!!!

Recommended Today

Notes on tensorflow 2 deep learning (I) tensorflow Foundation

This series of notes records the process of learning tensorflow2, mainly based on Learning First of all, it needs to be clear that tensorflow is a scientific computing library for deep learning algorithm, and the internal data is stored in theTensor objectAll operations (OPS) are also based on tensor objects. data type Fundamentals in […]