What problems should be considered in implementing a gray level publishing system?

Time:2020-10-27

In order to understand the functions of a gray publishing system, I think it is necessary to understand the concept definition and gray publishing process of gray publishing, clarify the purpose of gray level from the concept and process, and sort out the places that system tools can support in the process. Then, it is clear what needs to be considered to realize a set of publishing system. The purpose of gray-scale publishing is to upgrade the application smoothly when upgrading from the old version to the new version. During the upgrade process, some user traffic is selected according to a certain release strategy, and the application of the new version is requested in advance. The feedback of these users on the application of the new version, as well as the log, performance, stability and other indicators of the application instance of the new version are collected Review the new version of the application. According to the review, it is decided whether to continue to increase the application instances and traffic proportion of the new version until full upgrade, or roll back to the old version if problems are found. The corresponding gray publishing flow chart is as follows:

What problems should be considered in implementing a gray level publishing system?

According to the above concept and process definition of gray publishing, the problems that we need to consider in a gray publishing system are clear at a glance.

1. Release strategy customization

The deployment of a new version of an application is often divided into several stages in the gray-scale release process, and the number of instances is gradually increased. For example, a gray-scale release is divided into three stages, and the number of deployment instances of the new version will gradually increase in the three stages, from 10, 50 to 100. This is to ensure the stable operation of the overall function of the application. At the end of each phase, we can observe and review the effect of the new version, and decide whether to continue to add instances of the new version according to the release effect of the stage, or take policy rollback when problems are found. On the other hand, in order to increase the degree of automation of the publishing process, the gray publishing system will consider supporting the function of automatic execution between different stages. Of course, users will also need to add manual audit between stages, which needs the support of gray publishing platform. Therefore, it is necessary for gray-scale publishing system to support customized multi-stage publishing strategy.

2. Flow ratio

In the gray-scale publishing process, when the load balancing strategy of the traffic entrance is simply balanced by the number of instances, the traffic ratio processed by different application versions is the ratio of instances. However, in certain scenarios, this implementation limits the usage of user traffic configuration. For example, assuming that users are limited by resources, they want to use fewer new versions of instances to handle a larger proportion of traffic No more. The gray-scale publishing platform still needs to consider the application of the new version of the traffic matching function. In this way, combined with the function of customized release strategy mentioned in the previous point, users can control the user traffic of the new version more accurately than to achieve. For example, the gray publishing function implemented by Netease lightboat products has realized the cooperation with service mesh technology, which can accurately control the traffic ratio of each application version.

3. Log and monitoring

At each stage of the gray-scale publishing process, the publisher needs to decide whether to continue the upgrade process or roll back the problem directly according to the running situation of the new version at that time. The gray-scale publishing system needs to provide users with as many judgment indicators and reference data as possible, such as supporting users to view the operation logs of deployment instances, providing CPU, memory utilization, and Network card traffic and other monitoring data to provide a basis for the new version of the application function and stability judgment.

4. Fast rollback

For the deployment system, any online upgrade of the application needs to have the ability of fast rollback, so as to timely restore the old stable version and control the loss in case of problems. The rollback function specifically realizes the offline or deletion of the new version of the instance, the re creation of the old version instance, and the re switching of the traffic to the old version.

5. Alarm function

The publishing system needs to be responsible for the whole publishing process. In the process of docking users, I have also encountered users’ feedback that the old and new versions of gray process coexist for a long time. I hope to give an immediate alarm for the unfinished gray process. For example, after the new version of some mobile app goes online, it needs to run for a period of time to investigate and obtain the user’s feedback on the new function. At this time, if the release system can timely remind the user that the current gray process has not been completed, it is necessary to run for a period of time It is necessary to complete the gray publishing process and the application information of the old and new versions in the process. On the other hand, the release system also needs to give an alarm to the monitoring indicators in time. For example, the CPU utilization rate and memory utilization rate increase caused by the new version online can be timely notified to the publisher for processing.

From Netease cloud’s years of experience in the design and development of Devops products, the above five points are indispensable for a gray-scale publishing system. At present, Netease lightboat Devops products implement the gray-scale publishing function of hosts and containers according to these requirements. When users publish grayscale on the lightboat platform, they can customize the proportion of instances and the flow ratio of new and old versions at each stage At the same time, the system can automatically enter the next stage or manually audit the key nodes at the end of each stage. Once a problem is found, it supports users to quickly roll back. At the same time, the system also connects the functions of application log and monitoring data viewing, alarm notification, application version management, product management, etc., realizing the closed-loop management of application release.