Anti avalanche sharp device: the principle and application of fuse hystrix



In a distributed system, a basic service is often unavailable, which makes the whole system unavailable. This phenomenon is called service avalanche effect. In order to deal with the service avalanche, a common method is manual service degradation. The emergence of hystrix provides us with another choice

Definition of service avalanche effect

The avalanche effect of service is a causeService providerThe unavailability ofService callersIs not available and will not be availableGradually enlargeIf shown:

Anti avalanche sharp device: the principle and application of fuse hystrix

In the figure above, a is the service provider, B is the service caller of a, and C and D are the service callers of B. when a’s unavailability causes B’s unavailability and enlarges C and D gradually, the service avalanche is formed

The cause of service avalanche effect

I simplified the participants in the service avalanche asService providerandService callersThe process of service avalanche is divided into the following three stages to analyze the causes

  1. Service provider unavailable

  2. Try again to increase traffic

  3. Service caller not available

Anti avalanche sharp device: the principle and application of fuse hystrix

Each stage of the service avalanche can be caused by different reasons, such asService UnavailableThe reasons are as follows

  • Hardware failure

  • Program bug

  • Buffer breakdown

  • Large number of user requests

Hardware failure may be caused by hardware damage server host downtime, network hardware failure caused by the inaccessibility of service providers
Cache breakdown usually occurs when the cache application restarts, all the caches are cleared, and a large number of caches fail in a short time. A large number of cache misses make the request hit the back end directly, causing the service provider to run overloaded and the service unavailable
Before the start of seckill and big promotion, if the preparation is not enough, users will send a large number of requests, which will also result in the unavailability of service providers

And formedTry again to increase trafficThe reasons are as follows

  • User retrying

  • Code logic retry

After the service provider is unavailable, users can’t bear to wait for a long time on the interface
There will be a lot of service exception retrial logic in service invoker
These retries will further increase the request traffic

last,Service caller not availableThe main reasons are as follows

  • Resource exhaustion caused by synchronous waiting

When a service caller usesSynchronous callOnce the thread resources are exhausted, the services provided by the service callers will also be unavailable, so the service avalanche effect occurs

Coping strategies of service avalanche

According to the different causes of service avalanche, different coping strategies can be used

  1. flow control

  2. Improved cache mode

  3. Automatic service expansion

  4. Service caller demotion service

flow controlThe specific measures include:

  • Gateway current limiting

  • User interaction current limiting

  • Close and try again

Because of the high performance of nginx, the gateway of nginx + Lua is widely used by the first-line Internet companies for flow control, and openresty is becoming more and more popular

The specific measures of user interaction current limiting include: 1. Loading animation to improve the user’s endurance waiting time. 2. Adding forced waiting time mechanism to submit button

Improved cache modeThe measures include:

  • Cache preload

  • Synchronous to asynchronous refresh

Automatic service expansionThe main measures are as follows:

  • Auto scaling of AWS

Service caller demotion serviceThe measures include:

  • Resource isolation

  • Classify dependent services

  • The call to an unavailable service failed quickly

Resource isolation is mainly used to isolate the thread pool that calls the service

According to the specific business, dependent services are divided into strong dependence and if dependence. The unavailability of strong dependence service will cause the current business to stop, while the unavailability of weak dependence service will not cause the current business to stop

The call of unavailable services fails quickly, usually through theTimeout mechanism, FuseAnd fusedDegradation methodTo achieve

Using hystrix to prevent service avalanche

HystrixThe Chinese meaning of [h ɪ st’r ɪ KS] is porcupine. Because its back is covered with thorns, it has the ability to protect itselfHystrixIt also has the ability to protect the system

The design principles of hystrix include:

  • Resource isolation

  • Fuse

  • Command mode

Resource isolation

In order to prevent the spread of water leakage and fire, the cargo ship will divide the cargo warehouse into several parts, as shown in the figure below:

Anti avalanche sharp device: the principle and application of fuse hystrix

This way of resource isolation to reduce risk is called bulkheads
Hystrix applies the same pattern to service callers

In a highly service-oriented system, a business logic we implement usually relies on multiple services, such as:
Product details display service will rely on product service, price service and product review service

Anti avalanche sharp device: the principle and application of fuse hystrix

Calling the three dependent services will share the thread pool of the commodity detail service. If the commodity comment service is not available, all threads in the thread pool will be blocked due to waiting for a response, resulting in an avalanche of services

Anti avalanche sharp device: the principle and application of fuse hystrix

Hystrix avoids service avalanche by allocating independent thread pool to each dependent service
As shown in the figure below, when the commodity review service is not available, even if all the 20 threads independently allocated by the commodity service are in the synchronous waiting state, the call of other dependent services will not be affected

Anti avalanche sharp device: the principle and application of fuse hystrix

Fuse mode

The fuse mode defines the logic of switching between fuse switches

Anti avalanche sharp device: the principle and application of fuse hystrix

Health status of service = number of requests failed / total number of requests
The state transition of fuse switch from off to on is determined by the comparison of current service health status and set threshold

  1. When the fuse switch is closed, the request is allowed to pass through the fuse. If the current health condition is higher than the set threshold, the switch remains closed. If the current health condition is lower than the set threshold, the switch switches to the on state

  2. When the fuse switch is on, the request is prohibited

  3. When the fuse switch is in the open state, after a period of time, the fuse will automatically enter the semi open state. At this time, the fuse only allows one request to pass. When the request is successful, the fuse will return to the closed state. If the request fails, the fuse will continue to remain open, and the next request will be prohibited

The switch of the fuse can ensure that the service caller can return the result quickly when calling the abnormal service and avoid a lot of synchronous waiting. Moreover, the fuse can continue to detect the request execution result after a period of time and provide the possibility of resuming the service call

Command mode

Hystrix uses the command mode (inheriting hystrixcommand class) to wrap the specific service call logic (run method), and adds the degradation logic (getfallback) after service call failure in the command mode
At the same time, we can define the relevant parameters of the current service thread pool and fuse in the construction method of command

public class Service1HystrixCommand extends HystrixCommand<Response> {
  private Service1 service;
  private Request request;

  public Service1HystrixCommand(Service1 service, Request request){
            . withcoresize (20)) // number of service thread pools
            . withcircuit breaker errorthresholdpercentage (60) // the fuse is closed to the on threshold
            . withcircuitbreakersleepwindow inmilliseconds (3000) // time window length from fuse opening to fuse closing
      this.service = service;
      this.request = request;

  protected Response run(){

  protected Response getFallback(){
    return Response.dummy();

After using the command pattern to construct the service object, the service has the function of fuse and thread pool
Anti avalanche sharp device: the principle and application of fuse hystrix

Internal processing logic of hystrix

The following figure shows the internal logic of hystrix service call:
Anti avalanche sharp device: the principle and application of fuse hystrix

  1. Build command object of hystrix and call execution method

  2. Hystrix checks whether the fuse switch of the current service is on. If it is on, the degraded service getfallback method is executed

  3. If the fuse switch is off, hystrix checks whether the thread pool of the current service can receive new requests. If the thread pool is full, the degraded service getfallback method is executed

  4. If the thread pool accepts the request, hystrix starts to execute the specific logical run method of service call

  5. If the service execution fails, the degraded service getfallback method is executed, and the execution result is reported to metrics to update the service health

  6. If the service execution times out, the degraded service getfallback method is executed, and the execution result is reported to metrics to update the service health

  7. If the service is executed successfully, the normal result will be returned

  8. If the service degradation method getfallback is executed successfully, the degradation result is returned

  9. If the service degradation method getfallback fails, an exception is thrown

Implementation of hystrix metrics

Metrics of hystrix stores the health status of the current service, including the total number of service calls and the number of service call failures. According to the count of metrics, the fuse can calculate the call failure rate of the current service, which is used to compare with the set threshold to determine the state switching logic of the fuse. Therefore, the implementation of metrics is very important

Implementation of sliding window before 1.4

In these versions, hystrix uses its own defined sliding window data structure to record the count of various events (success, failure, timeout, thread pool rejection, etc.) in the current time window
When the event is generated, the data structure determines whether to use the old bucket or create a new bucket to count according to the current time, and modifies the counter row in the bucket
These modifications are executed by multithreads concurrently. There are many lock operations in the code, and the logic is complex

Anti avalanche sharp device: the principle and application of fuse hystrix

Implementation of sliding window after 1.5

Hystrix began to use rxjava in these versions Observable.window () realize sliding window
Windows of rxjava uses background thread to create new bucket, which avoids the problem of concurrent bucket creation
At the same time, the single thread lock free feature of rxjava ensures the thread safety when the count changes, which makes the code more concise
The following is a simple sliding window metrics implemented by using the window method of rxjava. It can complete the statistical function in just a few lines of code, which is enough to prove the power of rxjava:

public void timeWindowTest() throws Exception{
  Observable<Integer> source = Observable.interval(50, TimeUnit.MILLISECONDS).map(i -> RandomUtils.nextInt(2));
  source.window(1, TimeUnit.SECONDS).subscribe(window -> {
    int[] metrics = new int[2];
    window.subscribe(i -> metrics[i]++,
      () ->  System.out.println (window metrics:+ JSON.toJSONString (metrics)));


By using hystrix, we can easily prevent avalanche effect, and make the system have the effect of automatic degradation and automatic service recovery

Recommended Today

How to Build a Cybersecurity Career

Original text:How to Build a Cybersecurity Career How to build the cause of network security Normative guidelines for building a successful career in the field of information security fromDaniel miesslerstayinformation safetyCreated / updated: December 17, 2019 I’ve been doing itinformation safety(now many people call it network security) it’s been about 20 years, and I’ve spent […]