Cloud native transformation practice based on serverless


Introduction:What is the new generation technology architecture? How to change? This is a problem faced by many Internet enterprises. Cloud native architecture is the best answer to this question, because cloud native architecture upgrades the cloud computing service mode and Internet architecture as a whole, profoundly changing the IT foundation of the whole business world.

Author: Ji Yuan, solution architect of Alibaba cloud

What is cloud native architecture

Looking back over the past decade, digital transformation has driven the continuous integration and reconstruction of technological innovation and business elements. It can be said that it is no longer the business model that determines the technology architecture, but the technology architecture that determines the business model of an enterprise. Therefore, both industry giants and small, medium and micro enterprises are facing unknown opportunities and challenges brought by digital transformation. The opportunity is the innovation of business model, and the challenge comes from the change of the overall technical architecture.

What is the new generation technology architecture? How to change? This is a problem faced by many Internet enterprises. Cloud native architecture is the best answer to this question, because cloud native architecture upgrades the cloud computing service mode and Internet architecture as a whole, profoundly changing the IT foundation of the whole business world.

Although the concept of cloud Nativity has a long history, many people do not understand what cloud nativity is. From a technical point of view, cloud native architecture is a set of architecture principles and design patterns based on cloud native technology, which aims to maximize the stripping of non business code parts in cloud applications, so that cloud facilities can take over a large number of original non functional features (such as elasticity, resilience, security, observability, grayscale, etc.) in applications, So that the business is no longer troubled by non functional business interruption, at the same time, it has the characteristics of lightweight, agile and highly automated. In short, it is to help the business function of enterprises iterate faster, the system can withstand the impact of various levels of traffic, and at the same time, the cost of building the system is lower.

The difference between traditional architecture and cloud native architecture

The figure above shows three parts of the code, namely business code, third-party software and code dealing with non functional features. The “business code” refers to the code that implements the business logic“ “Three party software” is all the three party libraries that business code depends on, including business library and basic library“ “Dealing with non functional code” refers to the code that realizes high availability, security, observability and other non functional capabilities.

In these three parts, only the business code really brings value to the business, and the other two parts are only appendages. However, with the increase of software scale, business module scale, deployment environment and distributed complexity, today’s software construction becomes more and more complex, and the requirements for developers’ skills are also higher and higher. Compared with traditional architecture, cloud native architecture has made a big step forward, that is, it has stripped a large number of non functional features from business code to IAAs and PAAS, so as to reduce the technical focus of business code developers and improve the non functional ability of applications through the professionalism of cloud services.

This is the core idea of cloud native architecture.

Why cloud native architecture

After explaining what cloud native architecture is, you may have further thoughts, that is, why today’s Internet enterprises need cloud native architecture. By analyzing the market scale of SaaS, we can find that the market scale of SaaS will be 36 billion yuan in 2019, and it will still maintain a considerable upward trend in 2020. The market scale of SaaS is expected to exceed 100 billion yuan in 2022.

Looking at the development process of China’s enterprise SaaS industry, it can be roughly divided into four stages: before 2015, the Chinese market and the vast majority of Chinese enterprises lacked the basic understanding of “what is SaaS”, the traditional software form based on private deployment was still the mainstream, and the enterprise SaaS market was in the ascendant. By 2015, with the further maturity of cloud computing technology, China’s enterprise SaaS industry has entered a stage of rapid growth, and this slow track is gradually known to the public.

Today, in the context of epidemic situation, economy and social environment. Internet enterprises began to seek new business models, and some SaaS enterprises seized the opportunity to achieve rapid response, resulting in their business doubling, such as:

  • Catering SaaS manufacturers help offline catering stores develop small program ordering system to realize contactless ordering.
  • ERP manufacturers in the field of e-commerce retail help enterprises establish member management system.
  • Marketing SaaS manufacturers help enterprises with online marketing and reach customers remotely through traffic platform.

Therefore, under the background that “how to survive” has become a hot topic, rapid response ability has become the core competitive advantage among enterprises, and SaaS enterprises need to meet the new market demand in time. This is the inherent defect of Chinese SaaS enterprises in order to quickly occupy the market, blindly follow the trend and blindly learn from foreign products. In order to make up for these defects, SaaS manufacturers need to quickly adjust the direction of products and services according to the market demand. Business functions need to be diversified, business system needs new branches, and there are greater technical challenges.

In addition to the pressure brought by the market, SaaS enterprises also have many pain points

  • Most SaaS products only do the so-called general capabilities, or just blindly imitate foreign products. Once they go deep into the scene of heavy industry attributes, they can not meet the demand, so they are labeled as “semi-finished products”, resulting in less market acceptance than expected.
  • SaaS products with single product modules and similar functions are easy to fall into price competition. In the end, they can only gain network effect at a loss, but they will eventually suffer.
  • There is too much customization demand for SaaS products in the market, which makes SaaS enterprises lack the energy to polish their products and turn a SaaS company into a project company.

The core of SaaS enterprises to solve the above problems is to focus on business. To make a good SaaS product, there are more stringent requirements for business channels, competition pattern, user experience and many other aspects. Even from market operation, product manager to R & D, operation and maintenance, they should focus on the business. All these roles should serve the industry business, deeply analyze the industry, and quickly respond to the market, Stable product quality is a must. But this requires faster iteration speed of the technology, the speed of business launch is increased from weekly to hourly, the monthly online business volume is increased from “dozens / month” to “hundreds / day”, and the business interruption is unacceptable.

Another reason why Internet companies need cloud based transformation is that China’s Lewis turning point has arrived. Lewis turning point, that is, the turning point of labor surplus to shortage, refers to the gradual transfer of rural surplus labor force to non-agricultural industry in the process of industrialization, which leads to the gradual reduction of rural surplus labor force and finally to the bottleneck state. To put it bluntly, China’s demographic dividend has gradually subsided, and the labor cost of enterprises has been increasing. In addition to the impact of the epidemic in 2020, the cost factor has increasingly become an important consideration for enterprises. The features of SaaS products, such as subscription payment, versatility and low deployment cost, have become a new choice for enterprises to reduce costs and increase efficiency. This is an opportunity for SaaS enterprises in the market, and for SaaS enterprises themselves, there is also a demand for cost reduction and efficiency increase. The better the internal cost reduction and efficiency increase, the more competitive SaaS products will be in the market.

The solution of the above situation coincides with cloud native architecture and core competence

  • Cloud native completely covers the three-party software and non functional capabilities, which can greatly liberate the energy of enterprise R & D and operation and maintenance personnel, and enable them to focus on business.
  • The horizontal scalability, high availability, robustness and SLA of the system are covered by the cloud native architecture, which solves the most taboo stability problem of SaaS products.
  • Some self built components will be migrated to the cloud native architecture, and the traditional deployment methods and resources will be transformed into cloud native architecture. The implementation of gitops will further optimize the resource cost and labor cost.

How to implement cloud native architecture

Before talking about how to implement cloud native architecture, let’s take a look at the sesora model

Maturity model of cloud primary framework

There are six evaluation dimensions in the maturity model, which can be divided into four levels. I will explain how to implement cloud native architecture from the five dimensions of automation, non service, flexibility, observability and resilience.

Traditional architecture

The figure above shows a more traditional deployment architecture of java + spring cloud architecture on the application service side. Except SLB, almost all components are deployed on ECs. Now let’s take a look at how to transform this architecture into a cloud native architecture.


What is the concept of serverless? I won’t repeat it in this article. Please refer toThis articleTo understand. There are two obvious weaknesses in the architecture of ECS cluster deployment service

  • High operation and maintenance cost: all States and high availability of ECS need operation and maintenance.
  • Lack of flexibility: every time there is a big promotion activity, you need to purchase ECS in advance, expand the number of service nodes, and then release it. Moreover, this situation is only applicable to fixed-time activities, and if it is an irregular traffic pulse, you can’t deal with it.

So first of all, we need to make the service deployment method serverless. We can choose serverless App Engine (SAE) as the publishing and deployment platform of service application. SAE is an application-oriented serverless PAAS platform, which can help users to operate and maintain IAAs free, use on demand, and pay as you go, so as to achieve low threshold service application cloud origin, and support multiple languages and high flexibility.


openSAE consoleFirst, we create a namespace, which can logically isolate the network and resources of the applications under SAE. Usually, we can use the namespace to distinguish the development environment, test environment, pre release environment and production environment.

Create app

After creating a namespace, we can enter the application list and select a different namespace to see the application under it or create an application

Select the corresponding namespace and create the application:

  • Application Name: service name, user input.
  • VPC configuration:
    • Automatic configuration: automatically configure VPC, vswitch and security group for users. These components are all newly created.

      • Custom configuration: users select the namespace, VPC, vswitch and security group. Generally, we choose custom configuration, because the VPC of our service must be the same as that of other cloud resources, so as to ensure the smooth network. One thing to note here is that when an application is created for the first time under a new namespace, the namespace will be bound to the selected VPC. When an application is created later, the VPC corresponding to the namespace cannot be changed. If you need to change it, you can go to the namespace details to change it.
  • Number of application instances: the number of application (service) nodes. The number of nodes here is set on demand, and it is not the final setting, because the number of nodes can be expanded or reduced manually or automatically.
  • Vcpu / memory: the specification of CPU and memory required by the application during operation. The specification here is the specification of the number of single instances. If 2c4g is selected, there are two instances, namely 4c8g.

After configuring the basic information, the next step is to configure the application deployment

  • Technology stack language: Java language supports image, war package and jar package, while other languages support image deployment. Take Java jar package as an example
    • Application running environment: the default standard Java application running environment.

      • Java environment: currently supports JDK7 and jdk8.
      • File upload mode: support to manually upload jar package or configure the address where jar package can be downloaded.
  • Version: support time stamp and manual input.
  • Start command setting: configure JVM parameters.
  • Environment variable setting: set some variables in the container environment to facilitate the flexible change of container configuration after application deployment.
  • Host binding settings: set host binding to facilitate application access through domain name.
  • Application health check settings: used to determine whether the container and user business are running normally.
  • Application life cycle management settings: the definition of life cycle script on the container side manages some actions of the application before running and closing the container, such as environment preparation, graceful offline, etc.
  • Log collection service: integrate with SLS log service to manage logs uniformly.
  • Persistent storage: bind NAS.
  • Configuration management: inject configuration information into the container by mounting configuration files.

After I deploy the application with jar package, I can see the newly created application under the corresponding namespace

Click the app name to view the app details:

Bind SLB

Because ServiceA is a service that provides external interfaces in the architecture, it is necessary to bind the service to the public SLB to expose IP and do load balancing. In SAE, binding SLB is very simple. In the details page, you can see the application access settings:

When adding an SLB, you can choose to create a new SLB or an SLB that has already been created

Service / configuration center

For the microservice architecture, service center and configuration center are essential. We usually use Nacos, Eureka and zookeeper. For cloud native architecture, according to different scenarios, the service / configuration center can have the following choices:

For customers whose current situation is to use Nacos, there are two options for transforming cloud native architecture and service / configuration center as shown in the table above:

  • In the case of rapid transformation and low requirement for high availability of service / configuration center, it is recommended to directly use the Nacos provided by SAE. The code does not need to be changed, and the application can be directly created in SAE. The configuration management provided by SAE console is basically the same as that of open source Nacos console in interface operation and function.
  • In the case of high availability requirements for service / configuration center, it is recommended to use MSE Nacos, which has the advantage of exclusive cluster, and the node specification and number can be adjusted dynamically according to the actual situation. The only disadvantage is that the access point of Nacos needs to be modified, which is a bit of code intrusion.

For customers who are currently using Eureka and zookeeper, it is recommended to use MSE Eureka and MSE zookeeper directly.

Here I’d like to briefly introduce MSE.MSE (microservice engine)It is a one-stop micro service platform for spring cloud and Dubbo, the mainstream open source micro service framework in the industry, providing Governance Center, hosted registry and hosted configuration center. Here we use the MSE hosting registry and hosting configuration center.

MSE has three core functions

  • It supports three major service centers, with flexible configuration of node specification and number, and can dynamically adjust node specification / number at runtime.
  • Isolate different environments through namespace logic.
  • Configuration changes are pushed in real time and tracked.


The elastic ability in the maturity model of cloud native shelf is also based on SAE. Because of the underlying implementation principle of serverless, the number of application instances (nodes) in SAE can expand and shrink very fast, reaching the second level.

Enter the instance deployment information on the application details page to see the specific instance of the application:

SAE provides two ways to expand and reduce the number of application instances, manual and automatic.

Manual scaling

There is a manual expansion button on the top right of the console, and then select the number of instances to expand to

When expanding, we can see the change status of the specific instance

Auto scaling

In the upper right corner of the console, there is an auto zoom button, and then you can see the interface for creating zoom rules. SAE automatic expansion and contraction provides time strategy and index strategy.

The figure above shows the time strategy, that is, to set a specific time node, in which the number of application instances should be expanded to a few or reduced to a few. This strategy is suitable for scenarios with relatively clear time nodes in peak traffic, such as online education customers. Usually, the peak traffic starts at 8 p.m. and ends at 11 p.m. in this case, the number of application instances is expanded around 7:30 by timing strategy, and then the number of application instances is gradually retracted to normal after 11 p.m.

The figure above is the indicator strategy. At present, four indicators are provided, including CPU utilization, memory utilization, QPS threshold value of application and average response time (RT) threshold value of application interface. These four indicators can be used together. When one of the four indicators reaches the threshold, the expansion will be triggered, and the application instance will be gradually expanded. When the index is less than the threshold, the volume reduction will be triggered. This kind of strategy is suitable for the scene of traffic peak time is not fixed, such as marketing, game operation.

Cost optimization

As for the flexibility, we may pay more attention to its ability to make the system quickly support the flow pulse and increase the lateral expansion ability of the system. In fact, because SAE has the ultimate flexibility, plus the charging by minute and by volume mode, the overall resource cost is optimized to a certain extent.


The observability of the application side is divided into two dimensions. One is the vertical metrics indicators, such as the CPU, memory and disk indicators of the host, the CPU and memory indicators of the pod, and the full GC, heap memory and non heap memory indicators of the JVM. The other dimension is horizontal request invocation link monitoring, upstream service to downstream service invocation and upstream interface to downstream interface invocation.

When monitoring the microservice architecture, we usually need to look at it from three perspectives:

  • The overall health of the application.
  • Apply the health status of each instance.
  • Apply the health of each interface.

SAE’s monitoring of applications also covers the above two dimensions and three perspectives.

Application overall health

Enter the application details page and click theApplication monitoring/Application OverviewThen you can see the overall situation of the application:

  • Total request volume: you can see whether the request volume is obviously abnormal at a glance, such as sudden drop or sharp rise.
  • Average response time: the average response time of the whole application interface, which directly reflects the most real application health.
  • Full GC: a relatively important index in the JVM, and also a factor that will directly affect the performance of the system.
  • Slow SQL: intelligently grabs SQL whose execution time is more than 500ms.
  • Some curvilinear Views: help us master the application situation in different periods.

Health status of application instance nodes

Enter the application details page and click theApplication monitoring/Application details, you can see the information of each application node:

As you can see from the figure above, all the instance nodes of the current application will be listed on the left, including the average response time, request times, error number and heteroscedasticity of the node. If we sort the nodes in descending order according to the impact time, then we need to analyze what causes the slow response time of these nodes. Therefore, some inspection dimensions are provided on the right to help us analyze and solve problems.

For example, look at the JVM metrics:

Application interface health

Enter the application details page and click theApplication monitoring/Interface callThen you can see the information of each application interface:

The idea of interface monitoring is consistent with that of application instance node monitoring. On the left side, all requested interfaces are listed, and the response time, number of requests and number of errors are also displayed. On the right side, some inspection dimensions are also provided to help us analyze the reasons for high RT of interfaces.

For example, to view SQL call analysis:

Longitudinal metrics indicators

In fact, most metrics indicators have been covered from the above three perspectives, such as application health indicators, JVM indicators, SQL indicators, NoSQL indicators, etc.

Horizontal call link

In many cases, we simply look at the metrics indicators can not determine the root cause of the problem, because it will involve calls between different services, calls between different interfaces, so we need to check the call relationship between services, interfaces and interfaces, as well as the specific code information. In this dimension, the monitoring ability of SAE integrated arms can be realized. We can see the requested interface snapshot in application monitoring / interface call / interface snapshot. We can view the overall call link of the interface through traceid

We can see a lot of information from the above figure

  • The interface has a complete request link at the service level, such asikf(front end) – >ikf-web(Java services) – >ikf-blog(Java services) – >ikf-blog(Java service)
  • The status of each service, such as the red dot in the status column, indicates that there is an exception in this link.
  • The name of the interface requested in each service.
  • The request for each service takes time.

In addition to these explicit information, there are also some implicit information

  • Upstream servicesikf-webThe total time consumption is 2008ms, but the downstream service is limitedikf-blogThe total time consumption is 5052ms, and the starting point of time consumption is the sameikf-webreachikf-blogIs an asynchronous call.
  • sinceikf-webreachikf-blogIs an asynchronous call, howeverikf-webIt takes as much as 2 seconds to completeikf-webThere is also a problem with the interface in the service.
  • stayikf-blogIn the service, another internal interface is called, because there are two interfaces in the call chainikf-blogAnd this internal call is a synchronous call, and the problem occurs in the last interface call.

From the above information, we can narrow down and delineate the scope of the root cause of the problem. Then we can click the magnifying glass in the method stack column to view the specific code information of the method stack

From the method stack, we can get a lot of explicit information

  • Specific methods.
  • Time consuming of each method.
  • Method.
  • The specific SQL statement of database operation, even the binding value on SQL.

Of course, in addition to the explicit information, there is also a more important implicit information, for example, we can seeBlogController.findBlogByIsSelection(int isSelection)The time-consuming of this method is 5S, but the time-consuming of the internal database operation of this method is very little, only 1ms, which indicates that the time-consuming belongs to the business code. After all, we can’t grasp the business code and won’t capture the information. At this time, we can locate specific problems in two ways:

  • If you know the specific service and method from the method stack information, you can directly open the IDE to view and analyze the code.
  • Check the thread profile next to the method stack tab to basically determine the problem. For example, in the case of the above figure, when we look at the thread profiling, we will find that its time-consuming is due tojava.lang.Thread.sleep( ):-2 [3000ms]


As for the resilience of cloud native architecture, I will talk about it from three aspects: elegant online and offline, multi AZ deployment, current limiting and degradation.

Elegant online and offline

A good product should be able to quickly respond to users’ universal feedback and opinions on product functions and capabilities, and be able to quickly respond to changes in market demand. Then the function of the product needs to be iterated and optimized quickly. From the IT level, it is necessary to have a fast, efficient and high-quality release change process, which can release the service of the production environment at any time.

But this will bring a core problem, that is, frequent service publishing, and can not affect the user experience, user requests can not be cut off. So this requires our system deployment architecture to have the ability of elegant on-line and off-line.

Taking the microservice architecture as an example, although the open source products have the ability and solutions to achieve similar elegance, they are also similar. When there are more service nodes, there will still be no flow. So the open source solution has many problems

  • The registration center is unreliable and the notice is not timely.
  • Client rotation is not real-time, client cache.
  • The user-defined image is not process 1, and sigtermsignal cannot be delivered.
  • There is no default graceful offline scheme, so you need to add an actor component.

In the chapter of no service / service configuration center, I described the service center of SAE and MSE. No matter which scheme we use, we have further optimized the elegant online and offline.

As can be seen from the above figure, the application deployed in SAE has the ability to actively inform the service center and service consumers. Combined with the liveness application instance and readness application business detection mechanism, our service will not affect the normal access of users when it is deployed or hung up for other reasons.

Multi AZ deployment

In line with the principle that eggs can not be put in one basket, multiple nodes of an application should be distributed in different zones, so that the high availability and robustness of the overall application is good enough. SAE supports the setting of multiple switches (vswitch), and each switch can be in different zones. In this way, when deploying and expanding application nodes, it will randomly pull up from the optional zones

This avoids that when a certain zone fails, the whole system will fail because it is in one zone. This is also the most basic practice of living in the same city.

Current limiting degradation

Current limiting degradation is the ability of the system to survive by breaking its arm. When encountering a sudden traffic pulse, it can limit the traffic in time to avoid the breakdown of the whole system. Or when the traffic exceeds the expectation, it can cut off the non core business in time and release resources to support the core business.

At present, for Java applications, SAE also supports the ability of current limiting and degradation. First, it grabs and monitors the request indicators of all interfaces of the application

Then we can set flow control, isolation and fusing rules for each interface. For example, I’m responsible for/checkHealthThe interface sets a flow control rule

When the QPS of the interface reaches 50, a single machine with more than 50 requests will fail quickly. For example, when we carry out pressure measurement for an application with six nodes, we can see the QPS of each node

When the flow control rules are turned on, the effect of current limiting can be seen immediately

It can be seen that QPS is precisely controlled to 50, and requests over 50 fail directly.


In terms of automation ability, I mainly talk about it from the dimension of cicd process. As you can see from the screenshots in the above chapters, there are many screenshots of SAE console. In practical application, it is certain that the application will not be released one by one through the console. It is inevitable that the automatic application packaging and publishing process will be done through the cicd process.

SAE provides two ways to realize automatic operation and maintenance in this aspect.

Based on gitlab and Jenkins

At present, the cicd process of many enterprises is based on gitlab and Jenkins, so SAE also supports the integration of released operations into this solution. The core of this solution is that SAE will provide a maven plug-in with three corresponding configuration files. Maven plug-in will publish the packaged jar / war or image to the corresponding SAE application through the information in the three configuration files.

  • toolkit\_profile.yaml:Configure regionid, accessKey ID and accessKey secret.
  • toolkit\_package.yaml:Configuration, such as application deployment package type, deployment package address and image address.
  • toolkit\_deploy.yaml:Configuration, such as the ID of the SAE application, the ID of the namespace to which the application belongs, the name of the application, the publishing method, etc.

For more detailed configuration information, see thefile

Then, in Jenkins’ task, set the following command to Maven:

clean package toolkit:deploy -Dtoolkit_profile=toolkit_profile.yaml -Dtoolkit_package=toolkit_package.yaml -Dtoolkit_deploy=toolkit_deploy.yaml 

In this way, the deployment process of SAE can be easily integrated into the cicd solution based on gitlab and Jenkins.

Based on open API

There are also some enterprises that will develop their own operation and maintenance platforms, which are enabled by R & D students to operate on the operation and maintenance platform. Facing this scenario, SAE provides rich open API, which can integrate 90% of the capabilities of SAE console into customers’ own operation and maintenance platform through open API. Detailed OpenAPI instructions can participate in the processfile


After arming the system based on SAE, the overall deployment architecture will be as follows:

The cloud native architecture composition maturity model (sesora) is 15 points in the five dimensions I described, and the SAE based cloud native architecture is 12 points in the five dimensions

  • No service: 3 points
  • Flexibility: 3 points
  • Observability: 2 points
  • Resilience: 2 points
  • Automation ability: 2 points

What are you waiting for for the SAE solution, which is fast and simple to start, practice, and land, and can achieve a better maturity of cloud native framework? Let’s practice it together. If you have any questions, you can join usNail group:35712134Come and find out, we’ll see you!

Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.