Scenarios that must be considered when designing a stable microservice system

Time:2022-8-2

Author: Shimian

Our production environment is often unstable, such as:

  • The instantaneous peak flow during the rush hour led to the system exceeding the maximum load, and the load soared. The system crashed, causing users to be unable to place orders
  • “Dark horse” hot goods broke down the cache, and the dB was destroyed, crowding out normal traffic
  • The calling end is dragged down by unstable services, and the thread pool is full, causing the whole calling link to be stuck

These unstable scenarios may lead to serious consequences. You may want to ask: how to achieve uniform and smooth user access? How to prevent the impact of excessive traffic or unstable services?

introduce

The following two methods are common solutions in the face of traffic instability. They are also two capabilities that we have to consider before designing a highly available system. They are a very key link in service traffic governance.

flow control

The flow is very random and unpredictable. The first second may be calm, and the next second may be a flood peak (for example, the scene of double 11:00). Every system and service has its upper limit of capacity. If the sudden traffic exceeds the capacity of the system, it may lead to the failure of request processing, the slow processing of accumulated requests, the high cpu/load, and finally the system crash. Therefore, we need to limit this sudden traffic, and ensure that the service is not broken while processing requests as much as possible. This is traffic control.

Fuse degradation

A service often calls other modules, which may be another remote service, database, or third-party API. For example, when paying, you may need to call the API provided by UnionPay remotely; To query the price of a commodity, you may need to query the database. However, the stability of this dependent service cannot be guaranteed. If the dependent service is unstable and the response time of the request becomes longer, the response time of the method calling the service will also become longer, and the threads will accumulate, which may eventually deplete the thread pool of the business itself, and the service itself will become unavailable.

Modern microservice architectures are distributed and consist of many services. Different services call each other to form a complex call link. The above problems will produce amplification effect in link call. If a ring on a complex link is unstable, it may cascade layer by layer, resulting in the unavailability of the whole link. Therefore, we need to fuse and downgrade the unstable weak dependent services, temporarily cut off the unstable calls, and avoid the overall avalanche caused by local unstable factors.

Q: Many students are asking, so is it not necessary to carry out flow control and current limiting protection because the service level is very small? Is it true that the architecture of microservices is relatively simple, so there is no need to introduce circuit breaker protection mechanism?

A: In fact, it has nothing to do with the magnitude of the request and the complexity of the architecture. Many times, it may be a very marginal service failure that affects the overall business and causes huge losses. We need to have the awareness of failure oriented design, sort out capacity planning and strong and weak dependence, reasonably configure flow control degradation rules, and do a good job in advance protection, rather than remedy problems online later.

In the scenario of flow control, degradation and fault tolerance, we have many ways to describe our governance scheme. Next, I will introduce a set of open, general, distributed service architecture oriented service governance standard opensergo, which covers the whole link isomerization ecosystem. Let’s see how opensergo defines the standards of flow control degradation and fault tolerance, what are the implementation of these standards, and what problems can it help us solve?

Opensergo flow control degradation and fault tolerance v1alpha1 standard

In opensergo, we abstract the standard CRD from the implementation of flow control degradation and fault tolerance scenarios in combination with the scenario practice of sentinel and other frameworks. We can think of a faulttolerancerule as consisting of the following three parts:

  • Target: for what kind of request
  • Strategy: fault tolerant or control strategies, such as flow control, fusing, concurrency control, adaptive overload protection, outlier instance removal, etc
  • Fallbackaction: the fallback behavior after triggering, such as returning an error or status code

Let’s take a look at the specific standard definition of opensergo for commonly used flow control degradation scenarios, and how does it solve our problems?

First of all, as long as the microservice framework is adapted to opensergo, flow control degradation and other governance can be carried out through unified CRD. Whether it is Java or go or mesh services, whether it is HTTP requests or RPC calls, or database SQL access, we can use this unified fault-tolerant governance rule CRD to configure fault-tolerant governance for each ring in the microservice architecture to ensure the stability of our service links. Let’s take a detailed look at a configuration of opensergo in various specific scenarios.

flow control

The following example defines a cluster flow control strategy. The overall dimension of the cluster does not exceed 180 requests per second. Example Cr yaml:

apiVersion: fault-tolerance.opensergo.io/v1alpha1
kind: RateLimitStrategy
metadata:
  name: rate-limit-foo
spec:
  metricType: RequestAmount
  limitMode: Global
  threshold: 180
  statDuration: "1s"

Such a simple CR can configure our system with a flow control capability, which is equivalent to an airbag applied. Requests beyond the service capacity of the system will be rejected, and the specific logic can be customized by us (such as returning to the specified content or jumping to the page).

Fuse protection

The following example defines a slow call proportional fusing strategy, example Cr yaml:

apiVersion: fault-tolerance.opensergo.io/v1alpha1
kind: CircuitBreakerStrategy
metadata:
  name: circuit-breaker-slow-foo
spec:
  strategy: SlowRequestRatio
  triggerRatio: '60%'
  statDuration: '30s'
  recoveryTimeout: '5s'
  minRequestAmount: 5
  slowConditions:
    maxAllowedRt: '500ms'

The meaning of this Cr is: when the proportion of requests exceeding 500ms reaches 60% within 30s, and the number of requests reaches 5, the fuse will be automatically triggered, and the fuse recovery time is 5S.

Imagine, at the peak of business. When some downstream service providers encounter performance bottlenecks, even affect the business. We configure such a rule for some non critical service consumers. When the slow call ratio or error ratio within a period of time reaches a certain condition, the fuse will be automatically triggered, and the subsequent service call will directly return the result of mock. This can not only ensure that the caller is not overwhelmed by unstable services, but also give some “breathing” time to unstable downstream services, and ensure the normal operation of the entire business link.

Implementation of flow control degradation and fault tolerance standard

Sentinel introduction

The following is a project sentinel that supports opensergo flow control degradation and fault tolerance standards.

Sentinel is an open-source flow control component of Alibaba, which is oriented to the distributed service architecture. It mainly takes flow as the starting point to help developers ensure the stability of microservices from multiple dimensions, such as flow control, flow shaping, fuse degradation, system adaptive protection, etc.

Sentinel’s technical highlights:

  • Highly scalable capability: basic core + SPI interface expansion capability, users can easily expand flow control, communication, monitoring and other functions
  • Diversified flow control strategies (resource granularity, call relationship, flow control index, flow control effect and other dimensions) provide the ability of distributed cluster flow control
  • Hot spot flow detection and protection
  • Fuse degradation and isolation of unstable services
  • The system load adaptive protection in the global dimension adjusts the flow in real time according to the system water level
  • It covers the API gateway scenario and provides gateway flow control capabilities for spring cloud gateway and zuul
  • Cloud native scenarios provide the ability of envoy service grid cluster traffic control
  • Real time monitoring and rule dynamic configuration management capabilities

Some common usage scenarios:

  • In the service provider scenario, we need to protect the service provider itself from the flood peak. At this time, the flow is usually controlled according to the service capability of the service provider, or limited for a specific service caller. We can evaluate the endurance of the core interface in combination with the previous pressure test, and configure the current limit of QPS mode. When the number of requests per second exceeds the set threshold, redundant requests will be automatically rejected.
  • In order to avoid being dragged down by unstable services when calling other services, we need to isolate and fuse unstable service dependencies at the service consumer. The means include semaphore isolation, abnormal proportional degradation, RT degradation and other means.
  • When the system is at a low water level for a long time and the flow suddenly increases, directly raising the system to a high water level may instantly collapse the system. At this time, we can use Sentinel’s warmup flow control mode to control the flow through to increase slowly, and gradually increase to the upper limit of the threshold within a certain period of time, rather than release all at once. This can give the cold system a warm-up time to avoid the cold system being crushed.
  • Sentinel’s uniform queuing mode is used to “cut the peak and fill the valley”, spreading the request spikes evenly over a period of time, keeping the system load within the request processing level, and processing as many requests as possible.
  • Use Sentinel’s gateway flow control feature to protect traffic at the gateway entrance or limit the call frequency of API.

Alibaba cloud microservice solution

Alibaba cloud provides an enterprise level product MSE that fully complies with the opensergo microservice standard. We can understand the traffic governance capability in the enterprise version of MSE service governance as a commercial version of sentinel. We also briefly summarize the capability comparison between MSE traffic governance and community solutions in the scenarios of flow control degradation and fault tolerance.

Next, I will demonstrate how to protect our system through flow control and fuse degradation based on MSE, so that we can calmly face uncertain flow and a series of unstable scenarios.

  • Configure flow control rules

We can view the real-time monitoring of each interface on the monitoring details page.

We can click the “new protection rule” button in the upper right corner of the interface overview to add a flow control rule:

We can configure the flow control rules of the simplest QPS mode. For example, the above example limits the amount of single machine calls of this interface to no more than 80 times per second.

  • Monitor and view the flow control effect

After configuring the rules, wait a moment to see the current limiting effect on the monitoring page:

The rejected traffic will also return an error message. The embedded points of the framework of MSE have default flow control processing logic, such as 429 toomany requests returned after the web interface is restricted, and exceptions thrown after the Dao layer is restricted. If users want to customize the flow control processing logic of each layer more flexibly, they can access and configure the customized flow control processing logic through SDK.

summary

Flow control degradation and fault tolerance are the scenarios we have to consider when designing a stable microservice system. If we design each system, we need to spend a lot of effort to design the flow control degradation and fault tolerance of the system, which will become a headache for every developer. So we have contacted and designed so many flow control degradation systems. Are there any general scenarios, best practices, design standards and specifications, and even reference implementations that can be precipitated?

Starting from the scene, this paper briefly introduces the flow control and fuse protection standards of opensergo, and also introduces the background and means of sentinel flow protection. Finally, through an example, it introduces how to use the flow protection ability of MSE service management to protect your application.

Click to view the live video:

https://yqh.aliyun.com/live/d…

The opensergo standard is currently only the version of v1alpha1. It can be predicted that we still have a lot of ways to go in the continuous formulation and development of opensergo service governance standards. If you are also interested in the scenario of flow control degradation and fault tolerance, and are interested in the standard construction of microservice governance, you are welcome to join us. We will set standards and promote implementation in an open, transparent and democratic way. In the community, GitHub issue, gitter, mailing list, community biweekly meeting and other mechanisms are also used to ensure that standards and implementation are jointly built through community collaboration. Welcome to discuss and build together through these forms.

10% off for the first purchase of the professional version of MSE registration configuration center, and 10% off for the prepaid full specification of MSE cloud native gateway. clickhere, enjoy the discount!