Witnesses say | completely record the original cloud road of Koala shopping for more than a year

Time:2022-6-24

Witnesses say | completely record the original cloud road of Koala shopping for more than a year

By zhanghongxiao (flower name: Fu Jian), senior technical expert of Alibaba new retail
Source|Alibaba cloud native official account

preface

The whole cloud transformation of Koala Haigou started in October 2019. At that time, the only goal was to quickly complete the migration in a short time. In less than four months, the koala team only considered how to complete the mission as quickly as possible. Cloud native is the most appropriate way we chose.

Practice Course

This article mainly talks about the practice process of Koala overseas purchase from the third stage of cloud product access and the fourth stage of operation research mode upgrade.

Cloud product access

1. cloud original product definition

Cloud primordial is essentially a set of technical system and methodology. With the development of container technology, sustainable delivery, orchestration system and other technologies, and driven by the concepts of open source community and distributed microservices, cloud application has become an irreversible trend. The real cloud is not only the change of infrastructure and platform, but also the change of application itself. At each stage of architecture design, development mode, application operation and maintenance, based on the characteristics of the cloud, open source and standardization oriented, build new cloud based applications, that is, cloud native applications.

Cloud native technology helps organizations build and run flexible and scalable applications in new dynamic environments such as public cloud, private cloud and hybrid cloud. According to the definition of CNCF, the representative technologies of cloud native include container, service grid, micro service, immutable infrastructure and declarative API. Alibaba cloud provides message queue products, such as message queue rocketmq version and message queue Kafka version, and middleware cloud native products such as application real-time monitoring service arms, micro service engine MSE, application high availability service AHAs, performance test PTS, and function computing FC, which have laid a solid foundation for Kaola shopping to evolve from traditional applications to cloud native applications.

2. mental journey

We have experienced three stages in the process of accessing cloud products.

1) Stage 1: very good and powerful, with access to efficiency levers

This part is mainly from October, 2019 to March, 2020. At that time, all the products accessed were databases, redis and ASI. Compared with many users, they were relatively stable as a whole. They were basically fully compatible with open source products. The migration tools and surrounding construction were relatively complete. Therefore, the migration was very stable. Basically, it was enough to change some points.

2) Stage 2: cloud products are so rich that you can have everything you want

In the past, many components were maintained by ourselves, but with the increase of connection instances, the number of reads and writes increased, and downtime occurred from time to time. At that time, I heard that the microservice engine MSE was very easy to use. It provided one-stop microservice capability, including microservice dependent component hosting, intrusion free microservice governance, and faster, stable, and low-cost microservices. We went to the brothers of MSE. They patted their chest and said that there was no problem. After the product was running, there were no such problems.

There are still many examples like this. At that time, the feeling was that only when you really systematically use cloud native products, can you have a deeper feeling about the value of cloud native products.

3) Stage III: running in and adaptation

As koala Haigou began to access the group’s business platform, the supply chain also began to integrate with the group, and we also further developed the process of cloud. There are also challenges in the process. However, after overcoming many difficulties, we completed various transformations as scheduled and smoothly passed several major promotions. Cloud original products have well supported the growth of koala’s overseas purchase business.

3. access process

1) Access policy

Because the cloud products have certain capacity differences with the products built by Kaola Haigou, we have established a complete set of product evaluation and access test field mechanisms to ensure the orderly access and the portability of functions. It is the good operation of this mechanism that ensures our overall stability without major failures in the major changes of the whole foundation.

Our whole guarantee process is shown in the following figure:

2) Permission scheme

The first problem faced by accessing cloud products is how to manage cloud accounts and cloud product resource permissions? Alibaba cloud itself provides ram products as a service for managing user identities and resource access permissions. How are ram accounts associated with employee identities?

  • Is it to apply for a sub account for each product, and all users share the sub account?
  • Or apply for a ram sub account for each person and manage resource permissions for each person separately?
  • Or apply for a sub account for the application, and associate it with the resource permission of the sub account through the employee’s application permission?

Koala Haigou has hundreds of people. Both schemes 2 and 3 face high sub account life cycle and resource permission management costs. Therefore, when we used these middleware cloud products in the early stage, for simple reasons, we adopted the first scheme – apply for a sub account and use it together.

The problem is that the granularity of resource permissions is too coarse. For example, using schedulerx, you can log in to the console and operate all the tasks of all applications. This is a very dangerous thing for safety production. Therefore, for application security, the first requirement we put forward to the middleware cloud product is to provide the ability to authorize resources according to application granularity based on ram.

Koala Haigou users cannot perceive the ram account when logging in to the cloud console. Based on the capabilities of the ram cloud product STS (security token service), a layer of simple cloud console jump temporary authorization is encapsulated. When generating the STS token, the current user is obtained according to the BUC, and an additional permission policy is generated and specified to restrict the user’s permission to operate cloud resources (Applications). The login page is as follows:

Schedulerx is also based on the capabilities of STS. It is associated with employee identity through rolesessionname to complete permission management operations. Of course, this is only a temporary solution, which can help Kaola Haigou solve some problems. The final solution still depends on the overall situation. We will talk about this part later.

3) Message scheme

Migration destination:

Based on the message queue Kafka and the message queue rabbitmq, koala Haigou’s message system has developed its own transaction message center and delay message products to meet the message needs of rich businesses. After invoking the rocketmq product of the cloud message queue, it is found that it is perfectly compatible with and supports the existing complete message system of Koala Hisense, can provide sufficient performance guarantee and stable line guarantee, and additionally provides the function of supporting message trace and message query, which is more friendly to business use.

Implementation process:

The overall migration involves hundreds of projects of Koala overseas shopping, so it is impossible to arrange and transform in a unified time. Therefore, a migration scheme spanning several months has been developed for the scenario of Koala overseas shopping. The SDK has been developed to realize message double writing, topic mapping, and support a number of unique functional scenarios of Koala overseas shopping, such as pressure test messages. So that business students do not need to invest a lot of manpower. Upgrade the SDK and add a few lines of configuration to realize double writing of messages.

  • Phase 1: Message double write transformation for all businesses.
  • Phase 2: Message double reading transformation for all businesses.
  • Stage 3: carry out the overall closing stage of the message, and the business party switches to a separate write state. So far, the original message system of Koala Hisense has been completely stripped.

4) RPC scheme

RPC mainly involves RPC framework and service registry. Koala overseas purchase uses RPC framework dubbok (internal branch of Dubbo) + Nvwa (Koala self-developed Registration Center), while the group uses HSF + configserver.

Due to the early business needs of interworking with the group’s microservices, based on the compatibility of HSF with the Dubbo protocol, Alibaba cloud EDAs team provided us with an extension of the Dubbo configserver registry. After introducing the extension package, koala applications can easily and quickly call with the group’s HSF applications by registering CS and subscribing from CS.

Next, we began to use dubbo3.0 and restructure hsf3.0 based on the Dubbo kernel. After the upgrade, the original koala Dubbo application has all the features of HSF and can seamlessly interoperate with group services. However, as a new SDK, it is bound to face great challenges in terms of function and performance. In the early stage, we introduced the SDK for a month’s function test under the koala overseas purchase scenario, and solved nearly 40 function problems. At the same time, during the pressure test, the problems of call delay, registry push and cache are solved for the performance problems. At the same time, the expansion of Koala Haigou Dubbo registration center also needs to support dubbo3.0, which has finally experienced the large-scale verification of the double 11.

At the same time, we adopt the mode of dual registration and dual subscription, which also lays a foundation for the relocation and offline of the koala self research registration center in the future. After the application is upgraded, it can be modified to connect only to the CS connection string, and then offline Nvwa. At the same time, koala overseas shopping has also moved to the original cloud product micro service engine MSE. Special thanks go to the Alibaba cloud MSE team for their support for the relevant functions of the original koala governance platform Dubbo.

5) Schedulerx scheme

Challenge

Through investigation and comparison between the cloud based schedulex timed task bottle and the ksschedule timed task platform of Kaola Haigou, it is found that schedulex is an upgraded version of the ksschedule architecture. In addition to meeting the basic timed scheduling and piecemeal scheduling, it also supports larger-scale task scheduling. For the overall migration, the biggest difficulty lies in how to migrate the synchronous scheduled tasks of Koala Hisense 13000+. During this period, each task needs to be manually modified in the code and configured on the platform. Labor consumption is huge.

Migration scheme

  • The self-developed synchronization tool synchronizes 13000+ timing tasks and alarm information, and solves the massive human operation of business students.
  • The self-developed Kaola Haigou cloud native control platform synchronizes the timing task permission information to ensure the security after data migration.

6) Environmental isolation scheme

In the microservice scenario, environmental governance is a big problem. The essence of environmental isolation is to maximize the use of test environment resources and improve the efficiency of requirement testing. Koala originally developed a set of environment routing logic based on Dubbo’s routing strategy. The idea is based on the strategy of trunk environment plus project environment. Only the applications whose requirements involve changes need to be deployed. The traffic is first routed to the project environment by carrying the project tag. If there is no deployment, the services and resources of the trunk environment will be reused. Therefore, the stability of the backbone environment and the routing of the project environment are the top priorities of the test environment governance.

After the migration to Alibaba cloud, Alibaba cloud actually has a similar scheme based on SCM routing to achieve the same effect, as shown in the following figure:

However, SCM does not support the RPC framework dubbok and message framework of Koala Hisense. However, thanks to the excellent plug-in package mechanism of arms, we packaged the SCM plug-in of HSF into a plug-in through code enhancement and transplanted it to dubbok, which has the ability of aone SCM solution. Through the combination of JVM parameters and the release platform, we switched to the group’s SCM scheme within one week on the basis of full early testing and synchronization with QA development. In the follow-up, koala overseas purchasing basically carried out iterative development of requirements in the form of trunk environment + project environment.

7) High availability component scheme

AHAS current limiting:

There are three key points for current limiting: first, access requires embedding points in the application code or basic components, so that metrics can be collected and corresponding current limiting operations can be performed; Second, current limiting capacity, rule configuration and distribution; Third, monitoring and alarm.

AHAS and the original current limiting component (NFC) of Kaola Haigou are basically the same for users. They provide annotation, API display call, Dubbo filter, HTTP filter and other methods. During migration, they only need to replace the corresponding API. Because the component API is relatively simple, the access cost is relatively low. At the same time, AHAS also provides javaagent access capability, which can be accessed without modifying the code.

In terms of capability, AHAS is more complete than the components of the original koala, providing protection based on system load and fuse degradation. Originally, there was a requirement for the cluster current limiting function. The AHAS team was very awesome and launched this function before 618 for us to use. In terms of monitoring and alarm, it provides real-time second level monitoring, topn interface display and other functions, which are very perfect. There are also flow control alarms that are automatically triggered by nailing.

AHAS fault drill

Koala overseas purchase application is deployed in ASI. AHAS chaos has completed the access without feeling the business through the operator capability provided by k8s, and successfully participated in the 527 joint exercise of the group.

8) Transformation scheme of pressure measurement link

Koala originally had a shadow scheme for full link voltage measurement. Its core is mainly divided into two parts:

  • Full link piezometric beacon transmission
  • Traffic interception to realize shadow routing, service mock, etc

The first step of migration is to access the application real-time monitoring service arms; The second step of migration is to test the access performance of PTS, support arms and koala components, and take over koala’s original shadow routing logic.

Both arms and PTS use javaagent to embed various basic components through bytecode enhancement. The advantages of this operation are low access cost and low service awareness. Finally, we successfully completed the transformation of full link voltage measurement.

9) Intra city double living scheme

After Kaola Haigou moved to the group’s computer room, there was still the coexistence of self built, cloud products and group components for a period of time. Based on the current situation, we designed a set of our own dual active and SPE solutions.

Online normal status

The same computer room based on DNS and vipserver has priority, which can support both daily random traffic and single room traffic isolation.

Down pressure measurement state of single machine room

Infrastructure as code (IAC)

1. what is IAC

Infrastructure as code – infrastructure is code. It is a way to use new technologies to build and manage dynamic infrastructure. It regards infrastructure, tools and services as well as the management of infrastructure as a software system, and adopts software engineering practices to manage changes to the system in a structured and safe manner.

My understanding is that through the consistent management (change, version, etc.) of the software operating environment, software dependencies, and software code, and the provision of a baas like decoupling method, the software can be quickly copied and run in any environment without being bound by a specific environment.

2. practice content

1) Build deployment system

On the basis of koala’s original application Devops system and the concept of IAC & gitops, we have transformed the application construction, deployment, configuration loading, daily operation and maintenance based on appstack & IAC, and the relevant construction, deployment and application static configuration are all migrated to the application git source code. With the help of GIT, all relevant configurations of the application are managed. The version iteration of the configuration is clearer than the previous mode. At the same time, it can also effectively ensure the version consistency of the application source code, build configuration, container configuration and static configuration.

2) Lightweight container

Taking this cloud native transformation as an opportunity, we have benchmarked the original Kaola container image system with the group standard. The major change is to change the original startup user from appops to admin.

On the other hand, we introduced lightweight containers. As one of the foundations of cloud nativity, the isolation capability of the container layer is a big selling point. Koala Haigou has switched and completed the transformation of lightweight containers. It has divided the pod into application containers, operation and maintenance containers and custom containers. The entire deployment has become more lightweight and easier to control.

The deployment form after transformation is shown in the figure below.

3)CPU-share

The mode in the above figure is CPU set, that is, the container will bind some CPUs, and only the bound CPUs will be used during operation. This mode has the highest efficiency on the normal host, because it reduces CPU switching. The deployment of Koala Hisense has all switched to the CPU share mode, that is, under the same NUMA chip, the container can use all CPUs under the chip (the total number of CPU time slices will not exceed the limit configuration). In this way, as long as there are idle CPUs under the chip, the preemption will not be too intense, and the stability of operation will be greatly improved.

Finally, in the verification of the peak voltage measurement of DPCA, the CPU of DPCA can maintain a relatively stable running state below 55%, thus ensuring the stability of the overall service and making full use of resources

4) Mirror configuration separation

Image configuration separation refers to the separation of application container images from application dependent configurations (static configuration and publishing configuration). The purpose of this is to reuse the application image to the greatest extent, reduce the construction times of the application image, and improve the construction and deployment efficiency; At the same time, after migrating to appstack, the static configuration will also be rolled back automatically when the application code is rolled back. There is no need for businesses to manually go to the static configuration center to roll back the static configuration, which greatly reduces the risk of business rollback.

In addition, after the image and configuration are separated, the image can be deployed in any environment without relying on the configuration of the corresponding environment. In this way, our release process can be adjusted from change oriented to product oriented, and what goes online is the test image.

3. implementation strategy

1) Automation

The heavy tasks in IAC migration are configuration migration, environment migration and overall standardization. Improving migration efficiency will greatly speed up IAC migration, and will also have a positive impact on the mentality in the process of business development migration.

  • The build release configuration is stored on the old deployment platform of koala, and the static configuration is stored on the self-developed configuration center. The old deployment platform first gets through the configuration center of koala and the group gitlab code warehouse, and then according to the standardized service The cue template automatically creates various configurations of the old deployment center and configuration center directly into the business code, and automatically completes the migration of IAC configuration, which greatly saves the time of business migration and improves the migration efficiency.
  • We have developed a set of APIs for cloud native environment, which have the ability to automatically create, modify and delete cloud native environment and cloud native pipeline, and also improve the service access efficiency.

After the IAC automatic migration function is launched, it takes only about one minute for each application to complete the migration of various configurations, the creation of cloud native environment and cloud native pipeline, and no business access is required throughout the process. After completing the above configuration mapping and reconstruction, the application only needs to simply build and publish, and then solve some abnormal startup caused by compatibility problems, that is, the migration of IAC is completed, and the overall cost is relatively low.

2) Access support

Unlike the upgrade of middleware, IAC access involves changes in the entire release and deployment system of applications, and the stability of appstack is not particularly high at the current stage. Therefore, the access strategy we adopted is closed access in the project room, providing technical support throughout the process, ensuring that the business can solve problems at the first time, improve business participation and happiness, and collect problems at the first time, helping us optimize the access process, For example, in the early stage, the business needs to create a pipeline manually. Later, we will automatically create a corresponding pipeline for the business that needs to be migrated through the API.

However, the implementation of service migration IAC has two stages. In the two stages, we adopt different access modes, and achieve the goal of stable and fast service access by adopting different support modes in different stages.

Before the double 11

  • One person from the project team is resident in the project office for support
  • From Monday to Friday, the developers from different departments go to the conference room to focus on migration
  • Train relevant knowledge every morning, and switch applications in the afternoon and evening

After the double 11

  • Three members of the project team are resident in the project office for support
  • Only fixed departments will be relocated every week, and the Department will send fixed personnel to complete all the relocation work of the week
  • The training will be held every Monday morning

The main difference between the two is that the stability of the platform in the early stage and the familiarity with business R & D are relatively low. Therefore, the access is relatively cautious. It is more based on a verification and promotion mentality. After the subsequent relative stability, the overall access is based on the flat push mode.

achievements

1. no major failure

Koala Haigou’s cloud native transformation cycle is very long. No matter it is a big promotion such as 618 and double 11, or a common promotion such as the monthly member day, with the full cooperation of the project team members, there is no major failure caused by the cloud native transformation.

2. satisfactory integration results

  • To solve the difference between Kaola overseas purchase and the group’s application deployment, fully compatible with the current group’s model, and aligned with the group’s technical system at the deployment level.
  • Solve the difference between koala Hisense’s internal call and group call.
  • The construction of SPE and dual activities was completed, and the disaster recovery system was further aligned with the group.

3. efficiency improvement and cost saving

  • Migrate stateful containers, reduce deployment time by 100 seconds per batch, and solve the problem of startup failure caused by IP changes.
  • The configuration and code are strongly bound, and the rollback of static configuration is no longer required during subsequent rollback.
  • From daily capacity to large promotion capacity, the capacity of each application shall be expanded separately, and the capacity of the whole station shall be expanded to the reference water level by 0.5 man day.
  • Reduce the number of servers by 250.

4. improve cloud product functions

  • Promote the ease of use and stability of cloud products, and enrich the scene richness of cloud middleware products.
  • Promote the solution of safety production, account number and other issues in the cloud native process.

In the future, mesh is one of the development directions

Technology sinking is the general trend of Internet development。 In the era of micro services, service mesh came into being. Although the introduction of mesh agent will bring some performance loss and resource overhead, as well as the operation, maintenance and management costs of mesh service instances. However, it shields many complexities of distributed systems, allowing developers to return to business and focus on real value:

  1. Focus on business logic and shield the complexity of distributed system communication through mesh (load balancing, service discovery, authentication and authorization, monitoring and tracking, flow control, etc.).
  2. Language independent, services can be written in any language.
  3. The infrastructure is decoupled and transparent to applications. Mesh components can be upgraded separately, and the infrastructure can be upgraded iteratively faster.

Over the past year, koala Haigou has been firmly carrying out the transformation of Yunyuan bio chemical. Although many challenges have been encountered in the process, we have never doubted the correctness of this direction, and we have reaped more business value after solving problems every time. This year’s double 11, the whole cloud native upgrade helped koala reduce 250 servers, and precipitated a complete set of IAAs + PAAS cloud landing practice scheme. Koala’s R & D efficiency in the cloud has also been greatly improved. For example, by using the Alibaba cloud live broadcast center service, koala has quickly completed the construction of overseas live broadcast services from 0 to 1. In addition, new functions such as “tree climbing TV” and “like community” have also been launched one after another.

With the continuous development of cloud native transformation, the dividends brought by cloud native are becoming more and more significant. I believe that when the business is further decoupled from the infrastructure, one day the business will be independent of the infrastructure. Business R & D only needs to care about its own business and will no longer have to worry about the operating environment, so as to greatly improve the efficiency of operation research.

About the author

Zhanghongxiao (flower name: Fu Jian)Alibaba’s new retail senior technical expert, with 10 years of experience in development, testing, operation and maintenance, and rich experience in infrastructure and R & D efficiency, actively embraces cloud native and advocates a sustainable, fast and high-quality software delivery method.

If you want to know about the same open source cloud product of Koala double 11, please join usJanuary 30 spring cloud Alibaba meetup Guangzhou stationLearn how perfect diary and tiger tooth use SCA family bucket to enable business and implement microservices.