Principle and implementation of fuse

Time:2021-7-22

In microservices, inter service dependencies are very common. For example, the review service relies on the audit service, while the audit service relies on the anti spam service. When the review service calls the audit service, the audit service calls the anti spam service again. At this time, the anti spam service times out. Because the audit service depends on the anti spam service, the anti spam service time out causes the audit service logic to wait all the time, At this time, the review service is calling the audit service all the time, and the audit service may cause service downtime due to the accumulation of a large number of requests

call_chain

It can be seen that in the whole call chain, if an exception occurs in a certain link in the middle, it will cause some problems in the upstream call service, and even lead to the service downtime of the whole call chain. This is very terrible. Therefore, when a service calls another service as a caller, in order to prevent the service being called from having problems, which will lead to problems in the calling service, the calling service needs to protect itself, and the common means of protection is to protect itselfFusing

Principle of fuse

In fact, the fusing mechanism refers to the protection mechanism of the fuse in our daily life. When the circuit is overloaded, the fuse will automatically disconnect, so as to ensure that the electrical appliances in the circuit will not be damaged. The fusing mechanism in service governance means that if the error rate returned by the callee exceeds a certain threshold, the subsequent request will not really initiate the request, but directly return the error in the caller

In this mode, service callers maintain a state machine for each calling service (call path), in which there are three states

  • Closed: in this state, we need a counter to record the number of call failures and the total number of requests. If the failure rate reaches the preset threshold in a certain time window, we will switch to the off state. At this time, we will start a timeout. When the time is reached, we will switch to the semi closed state, The timeout gives the system a chance to correct the error that caused the call to fail, so as to return to the normal working state. In the off state, the call error is time-based and will be reset in a specific time interval, which can prevent accidental errors from causing the fuse to go into the off state
  • Open: in this state, when a request is initiated, an error will be returned immediately. Generally, a timeout timer will be started. When the timer times out, the state will switch to the half open state. A timer can also be set to detect whether the service is restored regularly
  • Half open: in this state, applications are allowed to send a certain number of requests to the called service. If these calls are normal, the called service can be considered to have returned to normal. At this time, the fuse switches to the off state and the count needs to be reset. If there is still a call failure in this part, it is considered that the callee has not recovered, the fuse will switch to the off state, and then reset the counter. The half open state can effectively prevent the service being recovered from being destroyed again by a large number of sudden requests

breaker_state

The introduction of fusing mechanism in service governance makes the system more stable and flexible, provides stability when the system recovers from errors, reduces the impact of errors on system performance, and can quickly reject service calls that may cause errors without waiting for real errors to return

Fuse lead in

The above introduces the principle of fuse. After understanding the principle, do you think about how we can introduce fuse? One solution is to add fuses into the business logic, but it is obviously not elegant and general enough. Therefore, we need to integrate fuses into the framework, and fuses are built into the zrpc framework

We know that the fuse is mainly used to protect the calling end. The calling end needs to go through the fuse before making a request, and the client interceptor just has this function. Therefore, in the framework of zrpc, the fuse is implemented in the client interceptor. The principle of the interceptor is as follows:

interceptor

The corresponding code is:

func BreakerInterceptor(ctx context.Context, method string, req, reply interface{},
	cc *grpc.ClientConn, invoker grpc.UnaryInvoker, opts ...grpc.CallOption) error {
  //Fuse based on request method
	breakerName := path.Join(cc.Target(), method)
	return breaker.DoWithAcceptable(breakerName, func() error {
    //Real call
		return invoker(ctx, method, req, reply, cc, opts...)
    //Codes. Acceptable determines which error needs to be added to the fusing error count
	}, codes.Acceptable)
}

Fuse realization

The implementation of the fuse in zrpc refers to the overload protection algorithm of Google SRE

  • Requests: the total number of requests initiated by the caller
  • Number of requests accepted: the number of requests normally processed by the callee

Under normal circumstances, the two values are equal. As the callee service starts to reject requests when there is an exception, the value of the number of requests accepted begins to be gradually less than the number of requests. At this time, the caller can continue to send requests until requests = k * accepts. Once this limit is exceeded, the fuse will open, The new request will be discarded locally with a certain probability and return an error directly. The calculation formula of the probability is as follows:

client_rejection2

The sensitivity of the fuse can be adjusted by modifying the K value in the algorithm. When the K value is reduced, the adaptive fusing algorithm will be more sensitive. When the K value is increased, the adaptive fusing algorithm will be less sensitive, Assuming that the upper limit of the caller’s request is adjusted from requests = 2 * accepts to requests = 1.1 * accepts, it means that one of every ten requests of the caller will trigger a fuse

The code path is go zero / core / breaker

type googleBreaker struct {
	K float64 // the default value is 1.5
	Stat * collection. Rollingwindow // the sliding time window is used to count the failure and success of requests
	Proba * mathx. Proba // dynamic probability
}

Implementation of adaptive fusing algorithm

func (b *googleBreaker) accept() error {
	Accepts, total: = b.history() // number of requests accepted and total number of requests
	weightedAccepts := b.k * float64(accepts)
  //Calculate the drop request probability
	dropRatio := math.Max(0, (float64(total-protection)-weightedAccepts)/float64(total+1))
	if dropRatio <= 0 {
		return nil
	}
	//Dynamically judge whether the fuse is triggered or not
	if b.proba.TrueOnProba(dropRatio) {
		return ErrServiceUnavailable
	}

	return nil
}

Each time a request is initiated, the doreq method will be called. In this method, first, whether the fuse is triggered through the accept validation is used to determine which errors count as failure

func Acceptable(err error) bool {
	switch status.Code(err) {
	Case codes. Deadlineexceeded, codes. Internal, codes. Unavailable, codes. Dataloss: // exception request error
		return false
	default:
		return true
	}
}

If the request is normal, the number of requests and the number of requests accepted will be increased by one through marksuccess. If the request is abnormal, only the number of requests will be increased by one

func (b *googleBreaker) doReq(req func() error, fallback func(err error) error, acceptable Acceptable) error {
	//Judge whether the fuse is triggered
  if err := b.accept(); err != nil {
		if fallback != nil {
			return fallback(err)
		} else {
			return err
		}
	}

	defer func() {
		if e := recover(); e != nil {
			b.markFailure()
			panic(e)
		}
	}()
	
  //Perform the real call
	err := req()
  //Normal request count
	if acceptable(err) {
		b.markSuccess()
	} else {
    //Exception request count
		b.markFailure()
	}

	return err
}

summary

Callers can protect themselves by fusing mechanism to prevent the downstream services from abnormal calls, or the business logic of callers will be affected by long time consuming. Many microservice frameworks with complete functions have built-in fuses. In fact, fuses are not only needed between microservice calls, but also can be introduced when calling dependent resources, such as MySQL and redis.

Project address:
https://github.com/tal-tech/go-zero

Recommended Today

A detailed explanation of the differences between Perl and strawberry Perl and ActivePerl

Perl is the abbreviation of practical extraction and report language “practical report extraction language”. Application of activestateperl and strawberry PERL on Windows platformcompiler。 Perl   The relationship between the latter two is that C language and Linux system have their own GCC. The biggest difference between activestate Perl and strawberry Perl is that strawberry Perl […]