Thinking and Practice on Dubbo version upgrade and multi machine room scheme of melon seed used car

Time:2020-10-21

preface

With the continuous development of melon seed business, the system scale is gradually expanding. At present, hundreds of Dubbo applications and thousands of Dubbo instances have been running on the melon seed private cloud. The business of each department of melon seed is developing rapidly, and the version is not unified in time. Each department has its own usage. With the construction of the second computer room, the demand for unified version of Dubbo becomes more and more urgent. A few months ago, the company had a Dubbo related production accident, which was the incentive for the company to upgrade based on Dubbo 2.7.3.

Next, I will start with this online incident, and talk about the process of Dubbo version upgrade we made during this period, and the plan of Dubbo’s subsequent multi machine room plan.

1、 Fixed the problem that the provider could not restore the registration because the epheral node was not deleted in time

Accident background

In the production environment, each business line in melon seeds shares a set of zookeeper cluster as the registration center of Dubbo. In September 2019, a switch in the computer room failed, leading to the network fluctuation of zookeeper cluster for several minutes. After the zookeeper cluster is restored, under normal circumstances, Dubbo’s providers should be re registered with zookeeper quickly. However, a small number of providers have not been re registered with zookeeper for a long time, and registration is not resumed until the application is manually restarted.

Investigation process

First of all, we counted the distribution of Dubbo service versions, and found that this problem exists in most Dubbo versions, and the proportion of services with problems is relatively low. In GitHub, we do not find the related issues. Therefore, it is inferred that this problem is not fixed and occurs occasionally in the network fluctuation situation.

Then, we will verify the application log, zookeeper log and Dubbo code logic. In the application log, after the application reconnects to zookeeper successfully, the provider immediately re registers, and then no log is printed. In the zookeeper log, the registered node is not recreated after it is deleted. In the code corresponding to Dubbo, only theFailbackRegistry.register(url)OfdoRegister(url)Only when the execution is successful or the thread is suspended can it match the situation in the log.

    public void register(URL url) {
        super.register(url);
        failedRegistered.remove(url);
        failedUnregistered.remove(url);
        try {
            // Sending a registration request to the server side
            doRegister(url);
        } catch (Exception e) {
            Throwable t = e;

            // If the startup detection is opened, the Exception is thrown directly.
            boolean check = getUrl().getParameter(Constants.CHECK_KEY, true)
                    && url.getParameter(Constants.CHECK_KEY, true)
                    && !Constants.CONSUMER_PROTOCOL.equals(url.getProtocol());
            boolean skipFailback = t instanceof SkipFailbackWrapperException;
            if (check || skipFailback) {
                if (skipFailback) {
                    t = t.getCause();
                }
                throw new IllegalStateException("Failed to register " + url + " to registry " + getUrl().getAddress() + ", cause: " + t.getMessage(), t);
            } else {
                logger.error("Failed to register " + url + ", waiting for retry, cause: " + t.getMessage(), t);
            }

            // Record a failed registration request to a failed list, retry regularly
            failedRegistered.add(url);
        }
    }

Before we continue to investigate the problem, we first popularize these concepts: Dubbo uses curator as the client of zookeeper by default, and curator and zookeeper maintain the connection through session. When curator reconnects hookeeper, if the session does not expire, it will continue to use the original session for connection; if the session has expired, a new session will be created to re connect. The ephemeral node is bound to the session. After the session expires, the ephemeral node under the session will be deleted.

Continue to be rightdoRegister(url)We found that theCuratorZookeeperClient.createEphemeral(path)There is a logic in the methodcreateEphemeral(path)CapturedNodeExistsExceptionWhen creating an ephemeral node, if this node already exists, it is considered that the ephemeral node is created successfully. This logic does not seem to be a problem at first, and it works well in the following two common scenarios:

  1. The session has not expired. When the ephemeral node is created, the original node still exists and does not need to be re created
  2. Session has expired. When creating the ephemeral node, the original node has been deleted by zookeeper. The creation is successful
    public void createEphemeral(String path) {
        try {
            client.create().withMode(CreateMode.EPHEMERAL).forPath(path);
        } catch (NodeExistsException e) {
        } catch (Exception e) {
            throw new IllegalStateException(e.getMessage(), e);
        }
    }

But there is actually an extreme scenario,The session expiration and deletion of zookeeper’s ephemeral nodes are not atomicIn other words, when the client gets the message that the session has expired, the ephemeral node corresponding to the session may not have been deleted by zookeeper. At this point, Dubbo creates the ephemeral node and finds that the original node still exists, so it does not create it again. After the ephemeral node is deleted by zookeeper, Dubbo thinks that the re registration is successful, but it is not. That is, the problem we encountered in the production environment.

At this point, the root cause of the problem has been identified. After positioning the problem, we communicated with Dubbo community and found that koala’s classmates had encountered the same problem, which confirmed the reason.

Recurrence and repair of problems

Once the problem was identified, we started to try to replicate it locally. Because it is difficult to simulate the scenario where the session of zookeeper is expired but the ephemeral node is not deleted, we modify the zookeeper source code and add a period of sleep time to the logic of session expiration and deletion of ephemeral node, indirectly simulate this extreme scenario, and reproduce the problem locally.

In the process of troubleshooting, we found that the old version of Kafka also encountered similar problems when using zookeeper. Referring to Kafka’s repair plan for this problem, we determined the fix scheme of Dubbo. When creating the ephemeral node, capture theNodeExistsExceptionIf the sessionid of the ephemeral node is different from the sessionid of the current client, delete and rebuild the ephemeral node. After the internal repair and verification, we submitted issues and PR to the community.

Kafka similar problems https://issues.apache.org/jira/browse/KAFKA-1387

Dubbo registration recovery issues: https://github.com/apache/dubbo/issues/5125

2、 Dubbo upgrade process of melon seeds

The fix for the above problem has been determined, but it is obviously impossible to fix it on every Dubbo version. After consulting the community Dubbo’s recommended version, we decided to develop an internal version to fix this problem based on Dubbo 2.7.3. And take this opportunity to promote the company’s Dubbo version of the unified upgrade work.

Why unify Dubbo version

  1. After unifying the Dubbo version, we can repair some Dubbo problems internally (such as the Dubbo registration failure recovery failure problem mentioned above).
  2. Melon seed is currently under construction of the second computer room, and some Dubbo services are gradually migrating to the second computer room. The unified Dubbo version also paves the way for Dubbo’s multi machine rooms.
  3. It is conducive to our unified management and control of Dubbo services in the future.
  4. The current development direction of Dubbo community is consistent with our company’s demands for Dubbo at this stage, such as supporting grpc and cloud native.

Why Dubbo 2.7.3

  1. We have learned that before us, Ctrip has already cooperated with Dubbo community in depth, and Ctrip has fully upgraded to the community version of 2.7.3, and is assisting the community in repairing some compatibility problems of version 2.7.3. Thank you for helping us step on the pit
  2. Although Dubbo 2.7.3 was the latest version at that time, it had been released for two months. According to the feedback from the community issues, Dubbo 2.7.3 is much better in compatibility than the previous versions.
  3. We also consulted the students in Dubbo community and recommended the upgrade version to 2.7.3.

Build positioning

The Dubbo internal version developed based on the community Dubbo 2.7.3 is a transitional version. The purpose is to fix the problem that the online provider can not restore the registration, and some community Dubbo 2.7.3 compatibility problems. Finally, the Dubbo of melon seeds should follow the version of the community instead of developing its own internal functions. Therefore, all the problems that we fixed in the Dubbo build are synchronized with the community to ensure that we can upgrade to a later version of the community Dubbo.

Compatibility verification and upgrade process

We started the Dubbo version upgrade in late September after consulting the Dubbo community students about their experience in version upgrading.

  1. Preliminary compatibility verification
    First of all, we sort out some compatibility cases that need to be verified, and verify the compatibility with Dubbo 2.7.3, which is widely used in the company. It is verified that Dubbo 2.7.3 is compatible with other Dubbo versions except dubbox X. Dubbox is not compatible with Dubbo 2.7.3 due to changes to Dubbo protocol.
  2. Production environment compatibility verification
    After the preliminary verification of the compatibility, we cooperated with the business line to select some less important projects to further verify the compatibility of Dubbo 2.7.3 with other versions in the production environment. Some compatibility issues have been fixed in the build.
  3. Promote company Dubbo version upgrade
    At the beginning of October, after completing the Dubbo compatibility verification, we started to promote the Dubbo upgrade in various business lines. 30% of Dubbo’s services have been upgraded by early December. According to the schedule, it is expected to complete the unified upgrade of Dubbo version by the end of March 2020.

Compatibility issue summary

On the whole, the process of promoting the upgrade of Dubbo 2.7.3 is relatively smooth. Of course, there are some compatibility problems:

  • When creating a zookeeper node, you are prompted that you do not have permission
    The user name and password of zookeeper has been configured in Dubbo configuration file, but it is thrown when the hookeeper node is createdKeeperErrorCode = NoAuthThis situation corresponds to two compatibility problems

    • issues:https://github.com/apache/dubbo/issues/5076
      Dubbo uses the registry as the configuration center by default when the configuration center is not configured. This problem is caused by missing the user name and password when initializing the configuration of the configuration center through the configuration information of the registry.
    • issues:https://github.com/apache/dubbo/issues/4991

When establishing a connection with zookeeper, Dubbo will reuse the connection established before according to the address of zookeeper. When multiple registries use the same address but have different permissions, it appearsNoAuthThe problem.

Referring to the PR of the community, we have fixed it in the internal version.

  • Curator version compatibility

    • Dubbo 2.7.3 is not compatible with lower curator versions, so we upgrade the curator version to 4.2.0 by default
    <dependency>
        <groupId>org.apache.curator</groupId>
        <artifactId>curator-framework</artifactId>
        <version>4.2.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.curator</groupId>
        <artifactId>curator-recipes</artifactId>
        <version>4.2.0</version>
    </dependency>
*Elastic job Lite, a distributed scheduling framework, strongly relies on the lower version of curator, which is incompatible with the curator version used by Dubbo 2.7.3, which blocks the updating of Dubbo version. Considering that elastic job Lite has not been maintained for a long time, some lines of business plan to replace elastic job Lite with other scheduling frameworks.
  • Compatibility between openfeign and Dubbo
    issues: https://github.com/apache/dubbo/issues/3990

Dubbo’s servicebean listens to the context refreshed event of spring for service exposure. Openfeign triggers the contextrefreshedevent in advance. At this time, the servicebean has not completed the initialization, which leads to the application startup exception.
Referring to the community’s PR, we fixed this issue in the build.

  • Rpcexception compatibility issues
    The consumer of Dubbo lower version does not recognize theorg.apache.dubbo.rpc.RpcException。 Therefore, it is not recommended to upgrade the provider’scom.alibaba.dubbo.rpc.RpcExceptionChange toorg.apache.dubbo.rpc.RpcException
  • QoS port occupancy
    Dubbo 2.7.3 enables QoS function by default, which results in QoS port occupation problem in some mixed parts when the Dubbo service of physical machine is upgraded. After the QoS function is turned off.
  • Custom extension compatibility issues
    There are few custom extensions for Dubbo in the business line, so there are no difficult problems in the compatibility of user-defined extensions. Basically, the problems are caused by changing the package, and the business line will repair it by itself.
  • Compatibility of skywalking agent
    Skywalking is generally used for link tracking in our project. Since the plugin of skywalking agent 6.0 does not support Dubbo 2.7, we upgrade skywalking agent to 6.1.

3、 Dubbo multi machine room scheme

Melon seed is currently in the construction of the second computer room, Dubbo multi machine room is an important topic in the construction of the second computer room. Under the premise of the unified version of Dubbo, we can more smoothly carry out the research and development work related to Dubbo multi machine room.

Preliminary plan

We consulted Dubbo community for suggestions, and combined with the current situation of melon seed cloud platform, initially determined the scheme of Dubbo multi machine room.

  1. In each computer room, a set of independent zookeeper cluster is deployed. Information is not synchronized between clusters. In this way, there is no problem of cross machine room delay and data synchronization in zookeeper cluster.
  2. When the Dubbo service is registered, it only registers in the zookeeper cluster of the computer room; when subscribing, it subscribes to the zookeeper cluster of two computer rooms at the same time.
  3. Realize the routing logic of priority call in the same computer room. In order to reduce unnecessary network delay caused by cross machine call.

Priority call in the same computer room

The implementation of priority call between Dubbo and computer room is relatively simple, and the related logic is as follows:

  1. Guazi cloud platform will inject the sign information of the computer room into the environment variables of the container by default.
  2. When the provider exposes the service, it reads the machine room flag information in the environment variable and appends it to the URL of the service to be exposed.
  3. When the consumer calls the provider, it reads the machine room flag information in the environment variables, and calls the provider with the same flag information first according to the routing policy.

In view of the above logic, we simply implement the function of Dubbo routing through environment variables, and submit PR to the community.

Dubbo routes PR through environment variables: https://github.com/apache/dubbo/pull/5348

The author of this paper is as followsLi Jintao is in charge of the second-hand infrastructure of Guazi. At present, he is mainly responsible for the upgrade and promotion of Dubbo version and skywalking promotion in the company.


Author: Li Jintao

Read the original

This article is the content of Alibaba cloud and can not be reproduced without permission.