Noah’s adaptive flow control solution is based on automatic control algorithm, which solves the pain point of manual current limiting configuration omission or outdated, and greatly improves the ability of application to resist flow impact. In the past double-11, Noah has guaranteed a large number of business application systems, with large-scale deployment of more than 15K containers; in terms of stability, it can increase up to 20 times the upper limit of business load flow QPS; it can improve the resource utilization rate by up to 100%; at the same time, it optimizes the experience and efficiency. Improving the stability chassis of Amoy system (and more BU) has become the core ability of application stability assurance, and promoted the progress of high availability / stability guarantee of large-scale distributed online business system in the industry.


With the continuous development of business, the number of applications, topology dependence and complexity are growing, and the effective prediction of traffic model becomes more difficult. The uncertainty of system and flow will lead to the omission or out of date evaluation of system capacity. These situations will be more prominent in large and complex activities such as double 11 and Spring Festival Gala.

Traffic Oriented “meaning”

Traffic will be affected by the business and strong changes, such changes are uncertain and frequent.

Dealing with this uncertainty requires that the infrastructure has a reliable, adaptive way to

Real time acceptance and perception of change

Adaptive, flexible implementation system / architecture to protect against changes

Then there is flexibility, which adjusts the basic resources in real time

Only in this way can we have effective resistance and adaptability. Living is the premise. With this premise, we can have the opportunity to drive the basic resources (flexibility), and the business does not need to pay attention to the business availability risk caused by this uncertainty.

Rethinking about high availability

When discussing high availability, the industry mainly talks about the corresponding methods of resource failure, such as coping with single machine failure through application cluster, main / standby, hot switching, etc.; and dealing with single machine room / regional failure through cellular architecture and remote multi activity.

From the above problems and Countermeasures of resource failure, we can see that the core problem of high availability is to ensure that the service does not hang up. To be exact, it is to greatly reduce the probability of service failure.

Today, with our application architecture, what are our usability pain points? What do you worry about or think about when you develop your own application or hear that the so and so service is down? In terms of usability, I believe that more answers will be “service overrun”. The typical question scenario is “big traffic, sudden traffic”, for example: in the live broadcast scene, the host yells, and the back-end traffic soars; in the social media scene, the star official announces marriage, which brings about a surge of hot traffic; in the e-commerce operation scene, the second kill activity has a large flow pulse; and the interactive play scene continues Continuous release of interactive activities, a new activity may lead to a significant decrease in the overall supportable QPS of the system. These scenarios and problems should have deep pain points in the process of daily application development and service guarantee:)

In terms of application architecture, the industry is not as concerned about the usability of traffic as much as it is about resource-oriented, with rich and mature ideas and extensive and effective practices. We intend to change the current situation of the topic of Traffic Oriented usability, and face more problems, explore ideas and promote practice.

What’s wrong with the traditional solution?

The current response to traffic oriented availability is static current limiting.

The problems of traditional static current limiting methods for QPS are as follows

  • Traffic / request

It depends on the accurate evaluation of the request model, that is, the request size of the test traffic is consistent with the actual demand.
Hot traffic, such as hot users are frequently visited / cold goods, etc. The operation leads to the heavy logic branch of the user’s moving line.

An accurate estimate of the flow that depends on it.
However, we have to admit that the traffic must not be accurately evaluated, and the system will be blocked (i.e. overload caused by refusing operation).

  • Business code logic

After dependency testing, the logic of the system itself and its downstream dependencies remains unchanged, and the performance needs to be consistent.
But the service is always evolving. Unless the whole network is blocked, the current limiting threshold will be out of date as soon as it is online.

  • resources

The performance of each machine is completely consistent and stable.

Different machine models, processing capacity is impossible to achieve complete consistency.

Virtualization / containerization has an impact, and you can’t control what your neighbors do.

  • technological process

The assessment process was performed accurately and manually in advance
But people are always unreliable, as long as the implementation of manual, there will be omissions, errors.

In addition, for long tail applications / non core applications, support is not guaranteed, and human resources are always limited.

Traditional methods can not solve the problem of inconsistent traffic and capacity caused by outdated manual evaluation. We need a solution that can evaluate the system capacity in real time and control the flow locally.

Noah adaptive flow control

Facing the problem of system stability, Noah The adaptive flow control solution adopts the static current limiting method which is different from the traditional QPS restriction method in the industry. For the first time, it provides an adaptive flow control solution with automatic control algorithm as the core means, which solves the pain point of outdated current limiting configuration, greatly improves the application’s ability to resist traffic impact, and extremely simplifies the relevant configuration work. Meanwhile, the system resource utilization, user experience, and The operation and maintenance efficiency has been greatly improved.

In most cases, CPU utilization as the main signal of resource supply is the most direct. Noah’s adaptive flow control solution takes automatic control of CPU resources as the core method, which has the following three advantages:

In terms of system stability control effect, it is accurate and effective control, and has strong interpretability.

If there is no advance manual evaluation, there will be no out of date evaluation and omissions and errors of manual evaluation.

It can be used in both synchronous and asynchronous scenarios.

Noah’s adaptive flow control solution can automatically evaluate QPS in real time, and uses adaptability to solve the uncertainty of business flow.

Business landing

As the core product of Taobao application architecture upgrade (Code: Tango: Taobao architecture next generation), Noah ensures a large number of business application systems in the past double 11 promotion process, and has large-scale deployment of more than 15K containers (involving Taobao, tmall, juhuasuan, HEMA, Maochao, Youku, etc.). Improve system stability, improve resource utilization, optimize experience and efficiency, improve the stability chassis of Taoxi (and more BU), become the core ability of application stability guarantee, and promote the development of high availability / stability guarantee of large-scale distributed online business system in the industry.

Noah’s adaptive flow control solution has been on-line for more than nine months. In online combat and full link pressure testing, Noah has protected core business scenarios such as conference hall, live broadcast, shopping guide, etc.; the application system still maintains stable operation in case of 30% capacity loss or nearly 20 times of ultra large flow pulse field.

The benefits of Noah’s adaptive flow control are as follows

  • Availability improvement

It can crush the upper limit of QPS and increase the traffic of business load by up to 20 times.

After the pressure drop of large flow rate, the service was quickly restored in 1 second.

Under large flow pressure, only one step is needed to expand the capacity of the machine directly, and there is no need to adjust the current limit urgently.

  • Optimization of user experience

In the case of high load, the highest service success rate can be increased by 2.7 times, while the response time maintains normal level without deterioration.

  • Cost optimization

Up to 100% increase in resource utilization (removing resource redundancy for stability / uncertainty)

  • Efficiency improvement

Full link pressure test / performance pressure test is smoother. There is no need to manually set the current limiting threshold, which can avoid the manual evaluation error and lead to a large amount of adjustment time after the system is crushed.

The actual control effect of adaptive flow control: when the flow rate is soaring / large flow pressure, the CPU is stably controlled at the threshold, and the service RT is normal

Subsequent development of Noah

At present, Noah’s adaptive flow control solution ensures a large number of business application systems, improves stability, resource utilization, optimizes experience and efficiency, and improves the stability chassis of Taoxi (and more BU). It has become the core ability of application stability guarantee, and promotes the progress of high availability / stability guarantee of large-scale distributed online business system in the industry.

Some prospects for the future more systematic construction are as follows

  • The adaptive ability is expanded from current limiting to isolation / fusing and other stability capabilities, such as

Adaptive thread resource isolation

Adaptive service ratio

Adaptive service fusing

  • From the adaptive flow limiting of single machine to the link level, especially the client traffic access layer

Cooperate with the access layer to make the entrance traffic match with the processing capacity of the application adaptively

Ensure high availability for traffic with certainty

  • Adaptive flow control extends to adaptive scalable capacity

Coordination of flow control and processing resource control

Both flow control and resource control are to make the processing flow match the resource capacity

  • Ensure the system is not overloaded and improve the stability and success rate of business requests

Author: Zheliang, Bafeng, Zebin

