preface
With the continuous development of melon seed business, the system scale is gradually expanding. At present, hundreds of Dubbo applications and thousands of Dubbo instances have been running on the melon seed private cloud. The business of each department of melon seed is developing rapidly, and the version is not unified in time. Each department has its own usage. With the construction of the second computer room, the demand for unified version of Dubbo becomes more and more urgent. A few months ago, the company had a Dubbo related production accident, which was the incentive for the company to upgrade based on Dubbo 2.7.3.
Next, I will start with this online incident, and talk about the process of Dubbo version upgrade we made during this period, and the plan of Dubbo’s subsequent multi machine room plan.
1、 Fixed the problem that the provider could not restore the registration because the epheral node was not deleted in time
Accident background
In the production environment, each business line in melon seeds shares a set of zookeeper cluster as the registration center of Dubbo. In September 2019, a switch in the computer room failed, leading to the network fluctuation of zookeeper cluster for several minutes. After the zookeeper cluster is restored, under normal circumstances, Dubbo’s providers should be re registered with zookeeper quickly. However, a small number of providers have not been re registered with zookeeper for a long time, and registration is not resumed until the application is manually restarted.
Investigation process
First of all, we counted the distribution of Dubbo service versions, and found that this problem exists in most Dubbo versions, and the proportion of services with problems is relatively low. In GitHub, we do not find the related issues. Therefore, it is inferred that this problem is not fixed and occurs occasionally in the network fluctuation situation.
Then, we will verify the application log, zookeeper log and Dubbo code logic. In the application log, after the application reconnects to zookeeper successfully, the provider immediately re registers, and then no log is printed. In the zookeeper log, the registered node is not recreated after it is deleted. In the code corresponding to Dubbo, only theFailbackRegistry.register(url)
OfdoRegister(url)
Only when the execution is successful or the thread is suspended can it match the situation in the log.
public void register(URL url) {
super.register(url);
failedRegistered.remove(url);
failedUnregistered.remove(url);
try {
// Sending a registration request to the server side
doRegister(url);
} catch (Exception e) {
Throwable t = e;
// If the startup detection is opened, the Exception is thrown directly.
boolean check = getUrl().getParameter(Constants.CHECK_KEY, true)
&& url.getParameter(Constants.CHECK_KEY, true)
&& !Constants.CONSUMER_PROTOCOL.equals(url.getProtocol());
boolean skipFailback = t instanceof SkipFailbackWrapperException;
if (check || skipFailback) {
if (skipFailback) {
t = t.getCause();
}
throw new IllegalStateException("Failed to register " + url + " to registry " + getUrl().getAddress() + ", cause: " + t.getMessage(), t);
} else {
logger.error("Failed to register " + url + ", waiting for retry, cause: " + t.getMessage(), t);
}
// Record a failed registration request to a failed list, retry regularly
failedRegistered.add(url);
}
}
Before we continue to investigate the problem, we first popularize these concepts: Dubbo uses curator as the client of zookeeper by default, and curator and zookeeper maintain the connection through session. When curator reconnects hookeeper, if the session does not expire, it will continue to use the original session for connection; if the session has expired, a new session will be created to re connect. The ephemeral node is bound to the session. After the session expires, the ephemeral node under the session will be deleted.
Continue to be rightdoRegister(url)
We found that theCuratorZookeeperClient.createEphemeral(path)
There is a logic in the methodcreateEphemeral(path)
CapturedNodeExistsException
When creating an ephemeral node, if this node already exists, it is considered that the ephemeral node is created successfully. This logic does not seem to be a problem at first, and it works well in the following two common scenarios:
- The session has not expired. When the ephemeral node is created, the original node still exists and does not need to be re created
- Session has expired. When creating the ephemeral node, the original node has been deleted by zookeeper. The creation is successful
public void createEphemeral(String path) {
try {
client.create().withMode(CreateMode.EPHEMERAL).forPath(path);
} catch (NodeExistsException e) {
} catch (Exception e) {
throw new IllegalStateException(e.getMessage(), e);
}
}
But there is actually an extreme scenario,The session expiration and deletion of zookeeper’s ephemeral nodes are not atomicIn other words, when the client gets the message that the session has expired, the ephemeral node corresponding to the session may not have been deleted by zookeeper. At this point, Dubbo creates the ephemeral node and finds that the original node still exists, so it does not create it again. After the ephemeral node is deleted by zookeeper, Dubbo thinks that the re registration is successful, but it is not. That is, the problem we encountered in the production environment.
At this point, the root cause of the problem has been identified. After positioning the problem, we communicated with Dubbo community and found that koala’s classmates had encountered the same problem, which confirmed the reason.
Recurrence and repair of problems
Once the problem was identified, we started to try to replicate it locally. Because it is difficult to simulate the scenario where the session of zookeeper is expired but the ephemeral node is not deleted, we modify the zookeeper source code and add a period of sleep time to the logic of session expiration and deletion of ephemeral node, indirectly simulate this extreme scenario, and reproduce the problem locally.
In the process of troubleshooting, we found that the old version of Kafka also encountered similar problems when using zookeeper. Referring to Kafka’s repair plan for this problem, we determined the fix scheme of Dubbo. When creating the ephemeral node, capture theNodeExistsException
If the sessionid of the ephemeral node is different from the sessionid of the current client, delete and rebuild the ephemeral node. After the internal repair and verification, we submitted issues and PR to the community.
Kafka similar problems https://issues.apache.org/jira/browse/KAFKA-1387
Dubbo registration recovery issues: https://github.com/apache/dubbo/issues/5125
2、 Dubbo upgrade process of melon seeds
The fix for the above problem has been determined, but it is obviously impossible to fix it on every Dubbo version. After consulting the community Dubbo’s recommended version, we decided to develop an internal version to fix this problem based on Dubbo 2.7.3. And take this opportunity to promote the company’s Dubbo version of the unified upgrade work.
Why unify Dubbo version
- After unifying the Dubbo version, we can repair some Dubbo problems internally (such as the Dubbo registration failure recovery failure problem mentioned above).
- Melon seed is currently under construction of the second computer room, and some Dubbo services are gradually migrating to the second computer room. The unified Dubbo version also paves the way for Dubbo’s multi machine rooms.
- It is conducive to our unified management and control of Dubbo services in the future.
- The current development direction of Dubbo community is consistent with our company’s demands for Dubbo at this stage, such as supporting grpc and cloud native.
Why Dubbo 2.7.3
- We have learned that before us, Ctrip has already cooperated with Dubbo community in depth, and Ctrip has fully upgraded to the community version of 2.7.3, and is assisting the community in repairing some compatibility problems of version 2.7.3. Thank you for helping us step on the pit
- Although Dubbo 2.7.3 was the latest version at that time, it had been released for two months. According to the feedback from the community issues, Dubbo 2.7.3 is much better in compatibility than the previous versions.
- We also consulted the students in Dubbo community and recommended the upgrade version to 2.7.3.
Build positioning
The Dubbo internal version developed based on the community Dubbo 2.7.3 is a transitional version. The purpose is to fix the problem that the online provider can not restore the registration, and some community Dubbo 2.7.3 compatibility problems. Finally, the Dubbo of melon seeds should follow the version of the community instead of developing its own internal functions. Therefore, all the problems that we fixed in the Dubbo build are synchronized with the community to ensure that we can upgrade to a later version of the community Dubbo.
Compatibility verification and upgrade process
We started the Dubbo version upgrade in late September after consulting the Dubbo community students about their experience in version upgrading.
-
Preliminary compatibility verification
First of all, we sort out some compatibility cases that need to be verified, and verify the compatibility with Dubbo 2.7.3, which is widely used in the company. It is verified that Dubbo 2.7.3 is compatible with other Dubbo versions except dubbox X. Dubbox is not compatible with Dubbo 2.7.3 due to changes to Dubbo protocol. -
Production environment compatibility verification
After the preliminary verification of the compatibility, we cooperated with the business line to select some less important projects to further verify the compatibility of Dubbo 2.7.3 with other versions in the production environment. Some compatibility issues have been fixed in the build. -
Promote company Dubbo version upgrade
At the beginning of October, after completing the Dubbo compatibility verification, we started to promote the Dubbo upgrade in various business lines. 30% of Dubbo’s services have been upgraded by early December. According to the schedule, it is expected to complete the unified upgrade of Dubbo version by the end of March 2020.
Compatibility issue summary
On the whole, the process of promoting the upgrade of Dubbo 2.7.3 is relatively smooth. Of course, there are some compatibility problems:
-
When creating a zookeeper node, you are prompted that you do not have permission
The user name and password of zookeeper has been configured in Dubbo configuration file, but it is thrown when the hookeeper node is createdKeeperErrorCode = NoAuth
This situation corresponds to two compatibility problems- issues:https://github.com/apache/dubbo/issues/5076
Dubbo uses the registry as the configuration center by default when the configuration center is not configured. This problem is caused by missing the user name and password when initializing the configuration of the configuration center through the configuration information of the registry. - issues:https://github.com/apache/dubbo/issues/4991
- issues:https://github.com/apache/dubbo/issues/5076
When establishing a connection with zookeeper, Dubbo will reuse the connection established before according to the address of zookeeper. When multiple registries use the same address but have different permissions, it appearsNoAuth
The problem.
Referring to the PR of the community, we have fixed it in the internal version.
-
Curator version compatibility
- Dubbo 2.7.3 is not compatible with lower curator versions, so we upgrade the curator version to 4.2.0 by default
<dependency> <groupId>org.apache.curator</groupId> <artifactId>curator-framework</artifactId> <version>4.2.0</version> </dependency> <dependency> <groupId>org.apache.curator</groupId> <artifactId>curator-recipes</artifactId> <version>4.2.0</version> </dependency>
*Elastic job Lite, a distributed scheduling framework, strongly relies on the lower version of curator, which is incompatible with the curator version used by Dubbo 2.7.3, which blocks the updating of Dubbo version. Considering that elastic job Lite has not been maintained for a long time, some lines of business plan to replace elastic job Lite with other scheduling frameworks.
-
Compatibility between openfeign and Dubbo
issues: https://github.com/apache/dubbo/issues/3990
Dubbo’s servicebean listens to the context refreshed event of spring for service exposure. Openfeign triggers the contextrefreshedevent in advance. At this time, the servicebean has not completed the initialization, which leads to the application startup exception.
Referring to the community’s PR, we fixed this issue in the build.
-
Rpcexception compatibility issues
The consumer of Dubbo lower version does not recognize theorg.apache.dubbo.rpc.RpcException
。 Therefore, it is not recommended to upgrade the provider’scom.alibaba.dubbo.rpc.RpcException
Change toorg.apache.dubbo.rpc.RpcException
-
QoS port occupancy
Dubbo 2.7.3 enables QoS function by default, which results in QoS port occupation problem in some mixed parts when the Dubbo service of physical machine is upgraded. After the QoS function is turned off. -
Custom extension compatibility issues
There are few custom extensions for Dubbo in the business line, so there are no difficult problems in the compatibility of user-defined extensions. Basically, the problems are caused by changing the package, and the business line will repair it by itself. -
Compatibility of skywalking agent
Skywalking is generally used for link tracking in our project. Since the plugin of skywalking agent 6.0 does not support Dubbo 2.7, we upgrade skywalking agent to 6.1.
3、 Dubbo multi machine room scheme
Melon seed is currently in the construction of the second computer room, Dubbo multi machine room is an important topic in the construction of the second computer room. Under the premise of the unified version of Dubbo, we can more smoothly carry out the research and development work related to Dubbo multi machine room.
Preliminary plan
We consulted Dubbo community for suggestions, and combined with the current situation of melon seed cloud platform, initially determined the scheme of Dubbo multi machine room.
- In each computer room, a set of independent zookeeper cluster is deployed. Information is not synchronized between clusters. In this way, there is no problem of cross machine room delay and data synchronization in zookeeper cluster.
- When the Dubbo service is registered, it only registers in the zookeeper cluster of the computer room; when subscribing, it subscribes to the zookeeper cluster of two computer rooms at the same time.
- Realize the routing logic of priority call in the same computer room. In order to reduce unnecessary network delay caused by cross machine call.
Priority call in the same computer room
The implementation of priority call between Dubbo and computer room is relatively simple, and the related logic is as follows:
- Guazi cloud platform will inject the sign information of the computer room into the environment variables of the container by default.
- When the provider exposes the service, it reads the machine room flag information in the environment variable and appends it to the URL of the service to be exposed.
- When the consumer calls the provider, it reads the machine room flag information in the environment variables, and calls the provider with the same flag information first according to the routing policy.
In view of the above logic, we simply implement the function of Dubbo routing through environment variables, and submit PR to the community.
Dubbo routes PR through environment variables: https://github.com/apache/dubbo/pull/5348
The author of this paper is as followsLi Jintao is in charge of the second-hand infrastructure of Guazi. At present, he is mainly responsible for the upgrade and promotion of Dubbo version and skywalking promotion in the company.
Author: Li Jintao
Read the original
This article is the content of Alibaba cloud and can not be reproduced without permission.