Introduction:With the popularity of microservice, more and more companies use microservice framework. Microservice, with its high cohesion, low coupling and other characteristics, provides better fault tolerance and adapts to the rapid iteration of business, which brings a lot of convenience to developers. However, with the development of business, the splitting of micro services becomes more and more complex, and the governance of micro services has become a headache
author | Yizhan
With the popularity of microservice, more and more companies use microservice framework. Microservice, with its high cohesion, low coupling and other characteristics, provides better fault tolerance and adapts to the rapid iteration of business, which brings a lot of convenience to developers. However, with the development of business, the splitting of micro services is becoming more and more complex, and the governance of micro services has become a headache. I believe you have more or less encountered the following scenarios.
Scene one: Publishing is a big thing. Every time you publish, half of the requests will be interrupted, and the upstream will continue to call the offline nodes, resulting in an error. At the same time, it also affects the user experience. After publishing, it needs to repair half of the dirty data.
In the above scenario, there is no problem with the new version. If there is a problem with the new version, it will cause a large number of businesses to directly request the new version with problems, ranging from repairing data to seriously affecting the user experience, or even causing asset loss. In the end, I had to release every version at two or three o’clock in the morning. I was scared and I didn’t get enough sleep.
Scene 2: in the middle of the night, a service node is abnormal, and the upstream is still calling, resulting in many exceptions and various alarm messages. After being awakened by the alarm, it’s a bit difficult to repair it online. If you want to keep the scene, you’re afraid to drag down the entire application, so you have to restart it first.
But this is only a way to treat the symptoms but not the root cause, because it’s difficult to reproduce, so it can’t be effectively positioned, and it may wake up again tomorrow and continue to restart. The above scenario is still based on the perfect alarm system. If there is no perfect alarm system, in serious cases, the whole business system may be pulled down by a single machine.
Scene 3After the business of the company has grown and the department organization has become more and more complex, there are more and more micro service modules. I don’t know who called the published service, so I don’t know if I can safely offline a service. This interface of my application is a sensitive interface. I only want to get my authorized application to call it, not directly get my address from the service registry, but it seems that I can’t do it at present.
The above three scenarios are really the pain points after using micro services. At this time, someone told you that I know how to solve these problems. I have rich experience and know how to solve them. You must be very happy.
Then high salary invited in, really good, all kinds of architecture diagram, framework principle, framework modification point are very clear, and the function is really perfect. Finally, to evaluate the modification cost of the current system, we need to build three sets of middleware servers, add four middleware dependencies, and modify tens of thousands of lines of code and configuration.
“Excuse me, the business is still important. The demand given by the product manager has not been completed. The scene just mentioned is not so painful. It’s just a few small problems. It’s really OK.”
At this time, EDAs tells you that the micro service solution of EDAs can perfectly solve the problems in the above three scenarios without any code and configuration changes.
You, don’t you?
Yes, you’re right, as long as your app isBased on the version development of spring cloud or Dubbo in the last five years, we can directly use the complete EDAs microservice governance capability without modifying any code and configuration。
Why can EDAs users easily publish?
- The traditional release process is really error prone
In the traditional publishing process, service providers stop and restart, and service consumers perceive that the service provider node stops as follows:
1. Before the service is released, the consumer calls the service provider according to the load balancing rules, and the service is normal.
2. Service provider B needs to release a new version. First, it operates on one of the nodes and stops the java process.
3. The process of service stop is divided into active logout and passive logout. Active logout is quasi real-time, and the time of passive logout is determined by different registries. The worst case will take 1 minute.
If the application stops normally, the shutdown hook of spring cloud and Dubbo framework can be executed normally, and the time consumption of this step can be ignored.
If the application stops abnormally, such as directly using kill – 9 to stop, or when the docker image is constructed, the Java application is not the No. 1 process and does not pass the kill signal to the application. Then the service provider will not log off the service node actively, but will be removed passively by the registry after a period of time due to heartbeat timeout.
4. The service registry notifies the consumer that one of the service provider nodes is offline. It includes push and polling. Push can be considered as quasi real-time. The polling time is determined by the polling interval of service consumers, and it takes one minute in the worst case.
5. The service consumer refreshes the service list and perceives that the service provider has offline a node. This step does not exist for the Dubbo framework, but the default refresh time of the load balancing component ribbon of spring cloud is 30 seconds, which takes 30 seconds in the worst case.
6. Service consumers no longer call the offline nodes.
From step 2 to step 6, Eureka takes 2 minutes in the worst case and Nacos takes 50 seconds in the worst case. During this period of time, the request may have problems, so there will be various errors when publishing, and it will also affect the user experience. After publishing, it needs to repair half of the dirty data. In the end, I had to release every version at two or three o’clock in the morning. I was scared and I didn’t get enough sleep.
- Why don’t EDAs users need to fix data
When your application is deployed to EDAs, the lossless offline function of EDAs will be automatically enhanced as follows when the new version is released. We mainly focus on the information in the green part:
1. Before the application is released, it will log off the application to the registration center and mark the application as offline.
2. When receiving the request from the service consumer, the service consumer will first process the call normally and inform the service consumer that the node is offline, and the service consumer will immediately delete the node from the call list.
3. After that, the service consumer no longer calls the offline node.
The lossless offline function of EDAs changes the original logout service from the original stop process phase to the prestop phase, and pushes the original logout service relying on the registry, so that the service provider can directly notify the consumer to remove itself from the call list. It greatly shortens the time of offline perception, achieves quasi real-time from the original minute level, and ensures that your application can achieve business lossless when offline.
- Canary releases additional protection for EDAs users
In the common new release scenario, the traffic to each node is evenly distributed by default.
Suppose there are four service providers. As soon as a node releases a new version, 25% of the traffic will reach the new version. If there is a problem with the new version, it will affect 25% of the online traffic, repair the data, seriously affect the user experience, and even cause asset loss.
The Canary publishing function provided by EDAs supports EDAs users to configure Canary rules in advance before releasing a new version, so that only the traffic that meets the traffic characteristics will be called to the new version, so that the traffic that is called to the new version can be accurately controlled and the new version can be verified.
As shown in the figure, EDAs users can configure Canary rules before publishing.
Take Dubbo as an example. The configuration in the figure below shows that only the traffic with the parameter “Hello world” can be routed to the new version in the traffic calling com.alibaba.edas.demo.echoservice.echo (string string).
Before the service provider registers the service to the registry, EDAs has pushed the Canary rules corresponding to the new version to the service consumer. When the service consumers call, they will analyze the traffic according to the Canary rules, compare with the metadata in the service provider list, and select the correct call address.
In addition to the simple parameter comparison shown in the figure above, EDAs also supports parsing more complex structures for rule configuration. Of course, if a scene only needs to control the traffic percentage to meet the demand, EDAs users can also directly grayscale.
The EDAs Canary release changes the traffic routing to the new version from the percentage of the total nodes to control according to the traffic characteristics. You can freely control the traffic routing to the new version. For example, you can only route the traffic of the internal test account to the new version, so that you can publish carefully and verify boldly. So, hurry to EDAs for easy release.
Why don’t EDAs users need to wake up in the middle of the night and restart the machine?
- Open source framework may be dragged down by a single point of exception
In the microservice architecture, when the application instance of the service provider is abnormal, the service consumers can not perceive it in time, which will affect the normal service invocation, and then affect the service performance and even availability of consumers.
In the example scenario above, the system consists of four applications, a, B, C and D. application a will call application B, C and D respectively. When some instances of application B, C or D are abnormal (for example, there are 1 and 2 abnormal instances of application B, C and D in the figure), if application a cannot perceive it, some calls will fail; If the business code is not elegant enough, it may affect the performance of application a and even the availability of the whole system.
- Outlier removal locks the stability of business system
In order to protect the service performance and availability of the application, EDAs supports the detection of the availability of the application instance and dynamic adjustment, so as to ensure the successful invocation of the service, so as to improve the stability of the business and the quality of service.
As shown in the figure below, EDAs users can configure application a as follows on the console to ensure the stability of application a.
- Exception type: network exception refers to IOException, business exception refers to return value in spring cloud framework, HTTP status code is 500, and Dubbo framework refers to exception in return value.
- QPS lower limit: in order to avoid too few calls and randomness, which will affect the accuracy of judgment, you can set the lower limit of QPS, and only when the QPS reaches a certain value, can you judge outlier extraction. The default value is 1, which can be configured as 0.
- Lower limit of error rate:If the error rate in the return value of a service provider exceeds the configured value, it will be determined that it needs to be removed.
- Upper limit of the proportion of removed instances: in order to avoid removing too many machine nodes, resulting in traffic overload of the remaining number of nodes, it is necessary to configure an upper limit of removal ratio, which is not more than 50%.
- Recovery test unit time: the action of outlier node removal is temporary. After unit time, the consumer side will detect this node. If the node has recovered, it is put back into the node. If the node continues to be removed, the removal time will increase linearly to the maximum.
Based on the outlier removal function, EDAs users will not wake up and restart the machine in the middle of the night because of the abnormality of the single machine, so they can have a good sleep first. Anyway, the business will not be affected. Wake up after the machine is still on the scene, is holding the reserved site for analysis, or directly restart, you choose.
Why are EDAs users confident about their services?
- Service query is clear at a glance
Our well-known zookeeper component does not have a service query interface. Eureka and Nacos registries provide a web version of the console, but only the IP and port of the service can be queried on the console.
When using service query, EDAs users can not only query which services are registered by the application, what are the corresponding IP and port, but also query the specific methods and parameter types contained in the service, and intuitively see the subscription of the service by other applications and nodes.
No matter how complex the department organization is and how many microservice modules there are, EDAs users can clearly find out the service being called and have a clear idea of it. They can be confident when sorting out the service dependence and evaluating the impact.
- Precisely control the authority of service invocation
After the business development, the service will meet the requirement of permission control. For example, an application in the coupon department includes both the coupon query interface and the coupon issuing interface. For the coupon query interface, all applications within the company have permission to call by default; However, only some applications of customer service and operation departments have the right to call the coupon issuing interface.
As shown in the figure below, EDAs users can manage the permissions of their own services. Here, Dubbo is taken as an example. The configuration in the figure below shows that the additemtocart method of com.alibaba.edas.demo.echoservice service published by cartservice only allows frontend to be called.
In addition to supporting the addition of authentication rules to the specified interface, service authentication also supports the addition of authentication rules to the entire application, and also supports authentication according to the caller IP.
Accurate authority management can make you better manage the authority of microservice call, ensure the compliance of business and ensure the security of data.
The cost of EDAs micro service governance is really low
The cost of using EDAs microservice governance can’t be lower. It doesn’t need to modify any code and configuration. You can enjoy the complete EDAs microservice governance capability by deploying the application directly.
As long as your application is based on spring cloud or Dubbo in the last five years, you can directly use the full version EDAs micro service governance ability, come to experience it!
Alibaba cloud native microservice product R & D team is recruiting people. We need like-minded you to build microservice products better, make application development easier, make application operation more stable, and realize business online forever.
In addition to EDAs and MSE (micro service engine), we also have arms (application real-time monitoring service), ACM (application configuration management), SAE (server less application engine) and other cloud products, which are waiting for you.
Contact information: Yizhang [email protected]
Xiao Jing (flower name: Yizhan), Alibaba cloud intelligent technology expert, spring cloud Alibaba PMC. Mainly responsible for the research and development of Alibaba cloud micro service products, focusing on micro service, cloud native and other technical directions.