preface
In a distributed system, a basic service is often unavailable, which makes the whole system unavailable. This phenomenon is called service avalanche effect. In order to deal with the service avalanche, a common method is manual service degradation. The emergence of hystrix provides us with another choice
Definition of service avalanche effect
The avalanche effect of service is a causeService providerThe unavailability ofService callersIs not available and will not be availableGradually enlargeIf shown:
In the figure above, a is the service provider, B is the service caller of a, and C and D are the service callers of B. when a’s unavailability causes B’s unavailability and enlarges C and D gradually, the service avalanche is formed
The cause of service avalanche effect
I simplified the participants in the service avalanche asService providerandService callersThe process of service avalanche is divided into the following three stages to analyze the causes
-
Service provider unavailable
-
Try again to increase traffic
-
Service caller not available
Each stage of the service avalanche can be caused by different reasons, such asService UnavailableThe reasons are as follows
-
Hardware failure
-
Program bug
-
Buffer breakdown
-
Large number of user requests
Hardware failure may be caused by hardware damage server host downtime, network hardware failure caused by the inaccessibility of service providers
Cache breakdown usually occurs when the cache application restarts, all the caches are cleared, and a large number of caches fail in a short time. A large number of cache misses make the request hit the back end directly, causing the service provider to run overloaded and the service unavailable
Before the start of seckill and big promotion, if the preparation is not enough, users will send a large number of requests, which will also result in the unavailability of service providers
And formedTry again to increase trafficThe reasons are as follows
-
User retrying
-
Code logic retry
After the service provider is unavailable, users can’t bear to wait for a long time on the interface
There will be a lot of service exception retrial logic in service invoker
These retries will further increase the request traffic
last,Service caller not availableThe main reasons are as follows
-
Resource exhaustion caused by synchronous waiting
When a service caller usesSynchronous callOnce the thread resources are exhausted, the services provided by the service callers will also be unavailable, so the service avalanche effect occurs
Coping strategies of service avalanche
According to the different causes of service avalanche, different coping strategies can be used
-
flow control
-
Improved cache mode
-
Automatic service expansion
-
Service caller demotion service
flow controlThe specific measures include:
-
Gateway current limiting
-
User interaction current limiting
-
Close and try again
Because of the high performance of nginx, the gateway of nginx + Lua is widely used by the first-line Internet companies for flow control, and openresty is becoming more and more popular
The specific measures of user interaction current limiting include: 1. Loading animation to improve the user’s endurance waiting time. 2. Adding forced waiting time mechanism to submit button
Improved cache modeThe measures include:
-
Cache preload
-
Synchronous to asynchronous refresh
Automatic service expansionThe main measures are as follows:
-
Auto scaling of AWS
Service caller demotion serviceThe measures include:
-
Resource isolation
-
Classify dependent services
-
The call to an unavailable service failed quickly
Resource isolation is mainly used to isolate the thread pool that calls the service
According to the specific business, dependent services are divided into strong dependence and if dependence. The unavailability of strong dependence service will cause the current business to stop, while the unavailability of weak dependence service will not cause the current business to stop
The call of unavailable services fails quickly, usually through theTimeout mechanism, FuseAnd fusedDegradation methodTo achieve
Using hystrix to prevent service avalanche
HystrixThe Chinese meaning of [h ɪ st’r ɪ KS] is porcupine. Because its back is covered with thorns, it has the ability to protect itselfHystrixIt also has the ability to protect the system
The design principles of hystrix include:
-
Resource isolation
-
Fuse
-
Command mode
Resource isolation
In order to prevent the spread of water leakage and fire, the cargo ship will divide the cargo warehouse into several parts, as shown in the figure below:
This way of resource isolation to reduce risk is called bulkheads
Hystrix applies the same pattern to service callers
In a highly service-oriented system, a business logic we implement usually relies on multiple services, such as:
Product details display service will rely on product service, price service and product review service
Calling the three dependent services will share the thread pool of the commodity detail service. If the commodity comment service is not available, all threads in the thread pool will be blocked due to waiting for a response, resulting in an avalanche of services
Hystrix avoids service avalanche by allocating independent thread pool to each dependent service
As shown in the figure below, when the commodity review service is not available, even if all the 20 threads independently allocated by the commodity service are in the synchronous waiting state, the call of other dependent services will not be affected
Fuse mode
The fuse mode defines the logic of switching between fuse switches
Health status of service = number of requests failed / total number of requests
The state transition of fuse switch from off to on is determined by the comparison of current service health status and set threshold
-
When the fuse switch is closed, the request is allowed to pass through the fuse. If the current health condition is higher than the set threshold, the switch remains closed. If the current health condition is lower than the set threshold, the switch switches to the on state
-
When the fuse switch is on, the request is prohibited
-
When the fuse switch is in the open state, after a period of time, the fuse will automatically enter the semi open state. At this time, the fuse only allows one request to pass. When the request is successful, the fuse will return to the closed state. If the request fails, the fuse will continue to remain open, and the next request will be prohibited
The switch of the fuse can ensure that the service caller can return the result quickly when calling the abnormal service and avoid a lot of synchronous waiting. Moreover, the fuse can continue to detect the request execution result after a period of time and provide the possibility of resuming the service call
Command mode
Hystrix uses the command mode (inheriting hystrixcommand class) to wrap the specific service call logic (run method), and adds the degradation logic (getfallback) after service call failure in the command mode
At the same time, we can define the relevant parameters of the current service thread pool and fuse in the construction method of command
public class Service1HystrixCommand extends HystrixCommand<Response> {
private Service1 service;
private Request request;
public Service1HystrixCommand(Service1 service, Request request){
supper(
Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("ServiceGroup"))
.andCommandKey(HystrixCommandKey.Factory.asKey("servcie1query"))
.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("service1ThreadPool"))
.andThreadPoolPropertiesDefaults(HystrixThreadPoolProperties.Setter()
. withcoresize (20)) // number of service thread pools
.andCommandPropertiesDefaults(HystrixCommandProperties.Setter()
. withcircuit breaker errorthresholdpercentage (60) // the fuse is closed to the on threshold
. withcircuitbreakersleepwindow inmilliseconds (3000) // time window length from fuse opening to fuse closing
))
this.service = service;
this.request = request;
);
}
@Override
protected Response run(){
return service1.call(request);
}
@Override
protected Response getFallback(){
return Response.dummy();
}
}
After using the command pattern to construct the service object, the service has the function of fuse and thread pool
Internal processing logic of hystrix
The following figure shows the internal logic of hystrix service call:
-
Build command object of hystrix and call execution method
-
Hystrix checks whether the fuse switch of the current service is on. If it is on, the degraded service getfallback method is executed
-
If the fuse switch is off, hystrix checks whether the thread pool of the current service can receive new requests. If the thread pool is full, the degraded service getfallback method is executed
-
If the thread pool accepts the request, hystrix starts to execute the specific logical run method of service call
-
If the service execution fails, the degraded service getfallback method is executed, and the execution result is reported to metrics to update the service health
-
If the service execution times out, the degraded service getfallback method is executed, and the execution result is reported to metrics to update the service health
-
If the service is executed successfully, the normal result will be returned
-
If the service degradation method getfallback is executed successfully, the degradation result is returned
-
If the service degradation method getfallback fails, an exception is thrown
Implementation of hystrix metrics
Metrics of hystrix stores the health status of the current service, including the total number of service calls and the number of service call failures. According to the count of metrics, the fuse can calculate the call failure rate of the current service, which is used to compare with the set threshold to determine the state switching logic of the fuse. Therefore, the implementation of metrics is very important
Implementation of sliding window before 1.4
In these versions, hystrix uses its own defined sliding window data structure to record the count of various events (success, failure, timeout, thread pool rejection, etc.) in the current time window
When the event is generated, the data structure determines whether to use the old bucket or create a new bucket to count according to the current time, and modifies the counter row in the bucket
These modifications are executed by multithreads concurrently. There are many lock operations in the code, and the logic is complex
Implementation of sliding window after 1.5
Hystrix began to use rxjava in these versions Observable.window () realize sliding window
Windows of rxjava uses background thread to create new bucket, which avoids the problem of concurrent bucket creation
At the same time, the single thread lock free feature of rxjava ensures the thread safety when the count changes, which makes the code more concise
The following is a simple sliding window metrics implemented by using the window method of rxjava. It can complete the statistical function in just a few lines of code, which is enough to prove the power of rxjava:
@Test
public void timeWindowTest() throws Exception{
Observable<Integer> source = Observable.interval(50, TimeUnit.MILLISECONDS).map(i -> RandomUtils.nextInt(2));
source.window(1, TimeUnit.SECONDS).subscribe(window -> {
int[] metrics = new int[2];
window.subscribe(i -> metrics[i]++,
InternalObservableUtils.ERROR_NOT_IMPLEMENTED,
() -> System.out.println (window metrics:+ JSON.toJSONString (metrics)));
});
TimeUnit.SECONDS.sleep(3);
}
summary
By using hystrix, we can easily prevent avalanche effect, and make the system have the effect of automatic degradation and automatic service recovery