Microservice governance practice: how to automatically remove single point exceptions

Time:2020-7-5

Under the microservice architecture, stability and high availability are an eternal topic. In the actual governance process, we may encounter the following scenarios:

  • An application gray-scale release, first on a few machines, due to code logic writing problems, resulting in thread pool full, abnormal operation.
  • In the server cluster, the load of some machines is too high due to disk full or host resource contention, and the client calls timeout.
  • In the server cluster, some machines are full of thread pools, resulting in full garbage collection.

In the above three scenarios, due to the client’s illegal perception of those servers with problems, they will still send requests to these machines, resulting in business call errors. The upstream machine will be dragged down by a short-term fault of a downstream machine, causing the risk of application avalanche.

Faced with this scenario, if we only degrade the service for this reason, the application will be hurt too much. However, if we can detect some fault machines in the service cluster and isolate them for a short time, we can effectively guarantee the high availability of services and the stability of the system. At the same time, it provides valuable buffer time for operation and maintenance personnel to locate problems and eliminate faults.

As the first part of the series of “micro service governance practice”, this paper will introduce how to realize the removal of outliers. This series of articles is based on the micro service practice of the commercial product EDAs of Alibaba cloud. If your team has strong micro service governance ability, we hope that our practice and thinking behind the micro service governance can provide some reference for you.

Microservice outlier injection

  • What is outlier instance removal

When the rammer exception occurs at a single point, the consumer can take the initiative to judge, and eliminate the corresponding provider instance for a short time, no longer request, and continue to visit after a certain time interval. At the same time, it has the ability of global exception judgment. When the number of provider exception instances is too large and exceeds a certain control proportion, it indicates that the overall service quality of the provider is low, and the mechanism only keeps a certain proportion of removal.

  • The function of removing outlier instances

From the fault tolerance capability of service layer, the service stability is enhanced to effectively solve the problem of single point of failure.

  • Difference between fuse and fuse

Fusing is a way of breaking the load when the input load of the service increases sharply to avoid the avalanche effect caused by the rapid collapse of the service. Fusing is generally composed of fuse request judgment algorithm, fuse recovery mechanism, and fuse alarm module. Isolation refers to an architecture method of uniting the system in order to avoid the failure diffusion caused by dependent service failure.

If only due to the single point exception problem in the server cluster, the fusing and degrading scheme will do too much harm to the application, and the removal of outlier instances can effectively solve the single point exception problem, so as to ensure the quality of service. If the overall service quality of the provider is low, the effect of outlier removal is no longer obvious, and the fuse degradation function can be used.

  • Outlier instance removal supported versions

As long as your application version is in the list, you can use the outlier removal function without changing a line of code.

Microservice governance practice: how to automatically remove single point exceptions

At present, most of the micro service scenarios on the market have been covered. In the future, we will continue to support the latest open source Dubbo / spring cloud version.

We provide two outlier removal functions for Dubbo and spring cloud scenarios. This article will first introduce the practice and effect of Dubbo microservice outlier injection.

Examples

Next, we will demonstrate the function and effect of Dubbo outlier removal on EDAs.

Enterprise distributed application service (EDAs) is a PAAS platform for application hosting and microservice management, which provides full stack solutions for application development, deployment, monitoring, operation and maintenance, and supports micro service operation environments such as Dubbo and spring cloud.

https://www.aliyun.com/product/edas

get ready

Next, take the micro service demo as an example to demonstrate the function of outlier removal. Readers can download the verification from GitHub

https://github.com/aliyun/alibabacloud-microservice-demo/tree/master/src

Micro service demo is a simple e-commerce project. The following figure shows the project structure. Cartservice is the shopping cart service provider of Dubbo framework, the product service provider of product service for spring cloud, and frontend is the web controller, that is, the front-end display page, which can be understood as consumer.

Microservice governance practice: how to automatically remove single point exceptions

We will take the cartservice service (Dubbo server) as an example to show the function of removing outlier instances.

Micro service demo on EDAs

First of allcd cartserviceSwitch to the cartservice directory, and then clickmvn clean installPackage, passcd cartservice-provider/targetSwitch to the target directory and see the newly generated cartservice-provider-1.0.0- SNAPSHOT.jar Then create a cartservice application on EDAs.

Microservice governance practice: how to automatically remove single point exceptions

After clicking next, upload the jar package just packaged, namely cartservice provider / target / cartservice-provider-1.0.0- SNAPSHOT.jar Next, remember the login password until the application is created successfully.

Microservice governance practice: how to automatically remove single point exceptions

Then start the application, so far, we’ve launched a cartservice provider. Click expand according to this instance specification, and the service is deployed on two instances.

Microservice governance practice: how to automatically remove single point exceptions

We are in the provider’scom.alibabacloud.hipstershop.provider.CartServiceImplClass, we can see that this provider provides two shopping cart services of viewcart and additemtocart. We add some logic to viewcart to simulate runtime exceptions.

@Value("${exception.ip}")
    private String exceptionIp;

    @Override
    public List<CartItem> viewCart(String userID) {

        if (exceptionIp != null && exceptionIp.equals(getLocalIp())) {
            Throw new runtimeException;
        }

        return cartStore.getOrDefault(userID, Collections.emptyList());
    }

Exceptionip is the ACM configuration center’s exception.ip If this item is configured as local IP, the service throw runtimeException is used to simulate the scenario of business exception.

  • You must have guessed why the cartservice was expanded to two instances. The runtime simulates an instance exception scenario by configuring the ACM configuration center to specify the IP address of one of the instances.

Next, we need to deploy frontend / productservice and upload frontend / target / frontend-1.0.0 in the same way- SNAPSHOT.jar And productservice / productservice provider / target / productservice-provider-1.0.0- SNAPSHOT.jar

As can be seen from the figure below, our micro service demo is deployed in EDAs.

Microservice governance practice: how to automatically remove single point exceptions

Simulate business exception

Entering the frontend application, we can see that the public IP address of the example is 47.99.150.33.

Microservice governance practice: how to automatically remove single point exceptions

Access to browser http://47.99.150.33 :8080/

Microservice governance practice: how to automatically remove single point exceptions

Click View Cart to access the http://47.99.150.33 :8080/cart

Microservice governance practice: how to automatically remove single point exceptions

As you can see, the service is normal at this time.

Let’s go to ACM configuration center exception.ip Is 172.16.205.180 (i.e. IP of one instance of cartservice).

Microservice governance practice: how to automatically remove single point exceptions

Then continue to visit http://47.99.150.33 : 8080 / cart, 50% error page found

Microservice governance practice: how to automatically remove single point exceptions

At this point, we write a script to access a large number of times http://47.99.150.33 : 8080 / cart simulates the request.

while :
do
        result=`curl $1 -s`
        if [[ "$result" == *"500"* ]]; then
                echo `date +%F-%T` $result
        else
                echo `date +%F-%T` "success"
        fi

        sleep 0.1
done

And thensh curlservice.sh http://47.99.150.33:8080/cart

We see a 50% success rate of calls repeated 10 times per second.

Microservice governance practice: how to automatically remove single point exceptions

In fact, it can also be understood that the downstream service quality will decline sharply with the abnormality of a certain machine in the upstream, which may even cause the downstream service to be dragged down by the abnormality (system and business) of some upstream machines.

Open outlier removal strategy

Next, I will demonstrate the opening of the strategy of outlier removal and the display of its effect.

establish

We enter the [outlier instance removal] interface under [microservice management] in the left list of EDAs, and select Create outlier instance removal strategy.

Microservice governance practice: how to automatically remove single point exceptions

Then follow the prompts step by step to create a strategy for outlier removal.

essential information

Microservice governance practice: how to automatically remove single point exceptions

As shown in the figure above, you can select the namespace, fill in the policy name, and select the framework type (Dubbo / spring cloud) supported by the policy.

Select effective application

Microservice governance practice: how to automatically remove single point exceptions

According to the current call mode, we only need to configure the frontend application to protect the downstream application consumer.

Configuration policy

Microservice governance practice: how to automatically remove single point exceptions

These parameters provide default values, and you need to adjust the most appropriate value according to the specific situation of your application. Since the runtimeException to be protected belongs to business exception, select network exception + business exception. (it should be noted that even if the upper limit of the removed instance ratio is particularly low, and the downward integer is less than 1, when the number of instances in the cluster is greater than 1 and an instance is abnormal, we will also remove an instance).

Creation completed

Microservice governance practice: how to automatically remove single point exceptions

You can see the information of the policy, and the creation is completed.

strategy

Microservice governance practice: how to automatically remove single point exceptions

We can see that the outlier removal strategy we created is aimed at Dubbo framework, and it is aimed at the exception type of network exception + business exception.

Verify the effect of outlier removal

At this time, we can see that after the exception is sensed again, the outlier removal function takes effect. After the request is called for a while, the correct results are returned.

Microservice governance practice: how to automatically remove single point exceptions

Constantly refresh browser access http://47.99.150.33 8080 / cart was normal

Microservice governance practice: how to automatically remove single point exceptions

After the client perceives the subexception of a server, it takes the initiative to remove it. We only call the provider instance with normal business. At the same time, we can see the improvement of service quality and the removal of traffic from abnormal providers through arms (EDAs monitoring system).

The Dubbo framework can search the “outlierrouter” keyword in the logs in the / home / admin /. Opt / armsadgent / logs directory to see the event logs of a series of outlier instances.

Modify / close outlier removal strategy

For the application of EDAs, we support dynamic modification and deletion of outlier removal strategy through the console.

  • Modification of corresponding policy rules

Click Modify to take effect to apply or edit the strategy.

Microservice governance practice: how to automatically remove single point exceptions

Then add or delete the application or adjust the parameters, which will take effect immediately after confirmation

  • Delete corresponding policy

Microservice governance practice: how to automatically remove single point exceptions

The operation of the console takes effect in real time for the configuration in the application. If the policy is deleted, the relevant policy will be closed by default.

If we turn on arms monitoring to observe the specific call situation.

Arms monitoring

If we turn on monitoring, we will see the traffic and request error information intuitively.

Open before outlier removal

Open it as shown in the figure below, and then jump to the arms (EDAs monitoring system) application monitoring page. We need to turn on advanced monitoring for all three applications.

Microservice governance practice: how to automatically remove single point exceptions

We can see the result intuitively from the following figure, which is the application monitoring page of arms (EDAs monitoring system).

Microservice governance practice: how to automatically remove single point exceptions

From the following topology, we can see that traffic is constantly accessing the cartservice service.

Microservice governance practice: how to automatically remove single point exceptions

Open outlier removal

The effect of outlier removal can be seen through a simple example. Of course, the improvement of service quality can be obviously observed through the monitoring of arms (EDAs monitoring system).

Microservice governance practice: how to automatically remove single point exceptions

It can be seen that the error rate decreased significantly from 50% after the outlier removal point was turned on.

Microservice governance practice: how to automatically remove single point exceptions

The two small fluctuation burr is that after the outlier is removed for a period of time, it will try to access the removed endpoint again. If the error rate is still higher than the threshold, the isolation will continue and the interval will be longer.

Specific control logic of outlier instance removal

Previously, we have seen the help of outlier strength removal to improve the application stability. Next, we will analyze the control logic of outlier instance removal, which will help you better understand the meaning of various parameters, and configure the most appropriate outlier removal strategy by adjusting parameters according to your own application situation.

For the Dubbo / spring cloud framework:

  • The default QPS lower limit is 1

Only when the current QPS of an instance is greater than 1 will the protection of outlier instances be removed.

  • Lower limit of default error rate 50%

Only if the call error rate of an instance is higher than 50%, the system will consider that the current instance of the server cluster is in an abnormal state.

  • Default removal instance proportion upper limit 20%

If more than 20% of the instance nodes in the current service cluster are in abnormal state, the system will only remove the number of instances in the abnormal state, accounting for 50% of the total number of clusters.

  • Exception type

If the exception type is network exception, the system will only count the network exception errors into the error rate statistics, ignoring the business exceptions; otherwise, if you select network exception + business exception, the system will count all the exceptions as errors in the error rate statistics.

  • Explanation on recovery detection unit time (default: 30000ms, 30s) and upper limit of cumulative times not recovered (default: 40)

The length of the first removal is 0.5 minutes. After the time is up, the consumer will continue to visit the provider. If the service quality of the provider is still poor, it will continue to be removed. The duration of the first removal increases linearly with the increase of the number of consecutive deletions. Each time increases by 0.5 minutes, and the maximum removal time is 20 minutes. Of course, if the quality of service is restored after the call is continued, it will be regarded as a health service. The next time an exception occurs, resulting in a low quality of service problem, it will be re isolated for 0.5 minutes and continue with the above rules.

  • The bottom line

However, when there is only one instance of the service called by the client, the service provider will not isolate this instance.

If more than one service instance is called by the current client, and the number of instances calculated by the current isolation ratio of outlier removal is less than 1, if there is a single point of failure in the server cluster, one instance will be removed.

All the above examples can be interpreted as endpoint (IP + port is latitude)

  • Common best practices

You can configure the relative error rate threshold (50%) and the lower upper limit (10%) of the removed instance ratio, and the full link is on.

Technical details of outlier extraction

Non intrusive Technology

The non intrusive scheme is realized by agent technology. In a word, bytecode enhancement technology is used to insert our code at runtime to change the original logic of the application. It can be understood as runtime AOP. By inserting filter / router into Dubbo’s link and enhancing loadbalance logic in spring cloud, we can realize our expected routing control logic. At the same time, because it is enhanced by agent, and the links of Dubbo versions are basically unchanged, and the spring cloud model is unified, we can basically cover all versions with less cost.

Microservice governance practice: how to automatically remove single point exceptions

For the user, without changing a line of code, a line of configuration, you can enjoy the ability of stability enhancement.

Outlier case extraction technology

Outlier detection

Data statistics are based on time window.

Two implementations

1. Dubbo version 2.7 embeds a metricsfilter into the link, punches each request / response of the link, counts the RT, success of the call, and exception type, and stores the endpoint (IP + port) as the key

2. Count the HTTP requests passed in the agent base, and count the data of the latest time window through the URL, RT, status code, exception type and other data results (currently, it is written for 10 seconds, and is not disclosed for the time being)

The call information of the first N seconds is counted in real time, which is used as the basis for removing outlier instances.

Outlier projection outlier extraction

Dubbo is implemented based on Dubbo router. For all invocators corresponding to the upstream service called, pull out the “unhealthy” nodes and record the blackout information.

Microservice governance practice: how to automatically remove single point exceptions

It can be used to determine whether there are two kinds of requests in the background, such as whether there are two kinds of requests in the background.

Spring cloud is based on the extension of loadbalance, and the principle is similar.

Micro service commercialization team recruited ~

Dubbo / spring cloud commercialization, in addition to EDAs, we also have independent products such as arms (application real-time monitoring service), MSE (microservice engine), ACM (application configuration management), SAE (serverless application engine), etc. What are we busy with? It is our job to polish these products carefully.

The goal of the team is to export Alibaba’s best practices in service governance to enterprise customers on Alibaba cloud in the form of productization, so as to help customers realize business online forever.

Recruitment email: shengwei.psw @alibaba- inc.com

Author information:Pan Shengwei, huamingshimian, R & D Engineer of middleware technology microservice product team, is responsible for the commercial product development of Dubbo / spring cloud. Currently, he mainly focuses on cloud native technology and micro service technology.


Author: Pan Shengwei

Read the original

This article is the content of Alibaba cloud and can not be reproduced without permission.

Recommended Today

ASP.NET Example of core MVC getting the parameters of the request

preface An HTTP request is a standard IO operation. The request is I, which is the input; the responsive o is the output. Any web development framework is actually doing these two things Accept the request and parse to get the parameters Render according to the parameters and output the response content So let’s learn […]