How can complex systems be upgraded without downtime while maintaining stability? You have to consider the following points



In the Internet industry, the upgrading of online services is common. According to statistics, in the past quarter, idle fish engineers have implemented more than 1000 releases, and the total number of updated code exceeds one million lines.

Some of these releases may only update a few lines of code, while others may perform the migration and upgrade of the entire cluster. And no matter how big the impact of these changes, we must ensure the availability of online services, users have no perception. This article will take the migration and upgrading of leisure fish search service as an example to introduce the technical solution behind it.

Basic architecture of free fish search service

The underlying search service of Xianyu consists of search planner, query planner, rank service and search engine heavy ask 3. The calling relationship between them is shown in the figure below:

How can complex systems be upgraded without downtime while maintaining stability? You have to consider the following points

It can be seen that the whole search service is composed of several independent micro services. Different microservices are isolated from each other and provide services through pre exposed interfaces. All microservices are finally closed through the search planner to provide a unified and complete search capability.

On top of the underlying search service, there are business logic layer and access gateway layer, and the specific architecture will not be described here. The user’s search request is first forwarded to the logic layer for processing through the gateway layer, and then the search request is sent to the underlying search service. This request chain contains dozens of clusters, the call depth reaches two digits, and there may be hundreds of servers providing services in the whole process.

For such a complex system, the upgrade process obviously can not be achieved overnight. The good news is that the reasonable decoupling between the various micro services has brought great convenience to the upgrade work, effectively avoiding being unable to start by pulling the whole body, so that we can deal with the upgrade problem in different categories.

  • Note 1: search planner is a search service gateway layer based on functional, service-oriented, visual and parallel development framework.
  • Note 2: the main function of query planner is to understand the user input, and then optimize the search term algorithm. Finally get better search results.
  • Note 3: rank service is a real-time scoring and sorting service. Its function is to score the audition results recalled by the search engine according to the multi-dimensional characteristics. The higher the score, the more likely the product is to be at the top of the search results.
  • Note 4: Heaven ask 3 is a stable, efficient and powerful search engine developed by Alibaba. Provide search service support for Alibaba Group’s core businesses including Taobao and tmall.

Keep compatible

Before upgrading, we first need to confirm whether the upgraded service maintains forward and backward compatibility. Maintaining compatibility not only reduces the workload, but also reduces the risk of failure caused by upgrade.

In order to avoid incompatibility caused by upgrade, we can summarize some development principles:

  • Remote procedure call (RPC) needs to be able to ignore unknown parameters and allow missing parameters.
  • If you need to delete the existing parameters, you need to confirm with all the dependent parties. Instead of removing the parameter directly, you can mark it as removed first.
  • When using parameters, distinguish between default and missing values.
  • If the interface is not compatible, a new interface is created to replace the old interface. Don’t break the compatibility of old interfaces.

When upgrading, first upgrade those services that have no external dependencies. After the dependent party is upgraded, upgrade the dependent party. After determining the upgrade sequence of each service, we can determine the upgrade scheme according to the actual situation of the service.

Stateless service upgrade

Formally entering the upgrade process, we first focus on the parts of the search link that are designed as stateless services, such as Java microservice for processing business logic, search planner for processing query logic, etc. Their common feature is that after each request is processed, the resource about the request is released. There are no interdependencies and timing requirements between different requests. Different machine nodes in the same stateless service are completely equivalent.

The characteristics of stateless services make it easy for them to expand and shrink dynamically through horizontal expansion. Therefore, on the premise of ensuring compatibility, their upgrade process is relatively common and simple

  1. The number of batches is determined according to the minimum service availability.
  2. Select a batch of containers to be updated and stop the service.
  3. Batch upgrade containers and update images.
  4. Wait for this batch of containers to resume service, and then continue to update the next batch of containers.

How can complex systems be upgraded without downtime while maintaining stability? You have to consider the following points

Generally speaking, we can achieve stateless service by storing state in message queue, cache, database or other external middleware. The advantages of designing services as stateless are obvious: there is no need to allocate additional machine resources during upgrade, the upgrade speed is fast, and the change cost is small, so it can support frequent iterative updates. However, this design also brings extra overhead to state access and update, which may not be applicable in some performance sensitive situations.

Stateful service upgrade

We continue to focus on the state part. The trouble of stateful service upgrade is that the storage, recovery and transfer of state are often designed by the service according to the actual situation (or not at all), so it is difficult to upgrade. We can simply list some relatively common stateful service upgrade options.

  • The access layer gateway provides the ability of heat update (such as nginx) and isolates the state keeping inside the access layer. It is suitable for scenes that need to stay in state for a long time.
  • The new request is gradually switched to the new service for processing, and the old service is destroyed after processing the stock request. It is suitable for the scene of keeping state for a short time (such as game service, real-time audio and video communication service).
  • Create a new service copy, keep the old and new services consistent through data double writing, and gradually replace the old services with new services.

In the framework of idle fish search, although the search engine itself provides stateless service, it keeps various states for processing index partition and incremental progress. The final upgrade scheme is as follows:

  1. Use the new version image to create a completely independent new engine.
  2. Full data synchronization between new and old engines.
  3. Incremental data is sent to both old and new engines at the same time.
  4. The new engine went online, gradually expanding the proportion of traffic.
  5. The old engine will no longer be offline after receiving traffic.

How can complex systems be upgraded without downtime while maintaining stability? You have to consider the following points

Compared with the upgrade of stateless service, this method not only uses twice the additional machine resources, but also needs to do a complex and tedious service configuration every time. If the service itself is not stateless, it needs to encode and implement the flow cutting logic to ensure that the requests of the same user can fall on the same cluster. The overall upgrade cost is relatively expensive, which is only suitable for services with very low update frequency. If the update frequency of the service is high, a scheme with lower upgrade cost should be designed and implemented according to the actual situation of the service.

Service discovery

In the process of upgrading, service discovery mechanism plays an important role. It provides us with the following functions:

  • Ensure distributed consistency
  • Elegant service
  • load balancing
  • Traffic control and request degradation
  • Priority dispatching in the same machine room
  • Cross machine room disaster recovery scheduling

How can complex systems be upgraded without downtime while maintaining stability? You have to consider the following points

Service discovery is the main valve of flow control. A mature and stable service discovery mechanism can not only effectively avoid the request success rate jitter caused by publishing, but also provide a guarantee for fast rollback in case of exception.

Risk prevention and control

It is undoubtedly a high-risk operation to upgrade, mount and cut the flow of each cluster of the search link according to the dependent order, and any carelessness may cause online failure. Therefore, we sorted out the upgrade process according to the three principles of Alibaba’s safety production

  • Monitorable: important indicators of important links ensure monitoring coverage in advance. For example, the total number of requests, request success rate, request response time and so on. Ensure that major problems can be found in time through monitoring indicators.
  • Grayscale: any change is not allowed to be published online without grayscale. For stateless services, we usually adjust the weight of service discovery or the proportion of machines to complete gray scale. For some cases that can’t be random grayscale, we design the mechanism of batching according to users.
  • Rollback: the change system provides a universal one click rollback capability, but it is not the fastest way. In many cases, we are ready to remount or remove the machine or cluster to be updated on the service discovery before implementing the change. The time from problem discovery to recovery is basically seconds.


To sum up, the principle and process of non-stop upgrade of complex system can be summarized as follows:

  1. Decoupling and isolation between services to ensure that the scope and impact of a single upgrade can be controlled.
  2. The upgrade order of services is determined according to compatibility and dependency.
  3. The upgrade mode is determined by whether the service is stateless or not.
  4. Prepare the monitoring and rollback scheme in advance to upgrade the gray level.

It took two months for the whole process of the upgrade. Among them, we not only ensure that the users have no perception, the stable operation of online services, but also ensure that the normal development of the algorithm team and other engineering teams developed with us will not be affected.

In the actual implementation process, we also encountered a lot of details. For example, when creating a new service, the budget demand is not reasonably estimated in advance, which leads to the continuous borrowing of budget in the process of upgrading. For another example, the delay problem caused by remote multi live deployment forces the service to remain unitary, which brings a lot of challenges to the traffic control work in the upgrade process. These exposed problems also provide guidance for us to continue to improve the architecture and scheme.

Author: idle fish technology
Reprinted from the official account