How to keep the stable output of services in case of traffic surge

Time:2021-7-28

Service adaptive load shedding protection design

Design purpose

  • Ensure that the system is not overwhelmed by excessive requests
  • Provide higher throughput as much as possible on the premise of ensuring system stability

Design considerations

  • How to measure system load
    • Whether it is in the virtual machine or container, it is necessary to read the load related to CGroup
    • 1000m is used to represent 100% CPU, and 800m is recommended to represent high load of the system
  • Keep the overhead as small as possible without significantly increasing RT
  • Regardless of the dB or cache system that the service itself depends on, such problems are solved through the circuit breaker mechanism

Mechanism design

  • When calculating CPU load, use sliding average to reduce the instability caused by CPU load jitter. See resources for sliding average

    • Moving average is the approximate average of the previous consecutive n times, and the value of N can be determined by hyperparametric beta
    • When the CPU load is greater than the specified value, the load reduction protection mechanism is triggered
  • The time window mechanism uses the sliding window mechanism to record QPS and RT (response time) in the previous time window

    • The sliding window uses 50 buckets in 5 seconds, each bucket saves requests within 100ms, recycles, and the latest covers the oldest
    • When calculating maxqps and minrt, it is necessary to filter out the bucket that has not been used up in the latest time to prevent there are only a few requests in this bucket, and RT is at the minimum of low probability. Therefore, when calculating maxqps and minrt, only 49 buckets will be calculated according to the above 50 bucket parameters
  • Reject the request if all of the following conditions are met

    1. The current CPU load exceeds the preset threshold, or the last rejection time does not exceed 1 second (cooling period). The cooling period is to prevent the load from increasing the pressure immediately after it comes down, resulting in back and forth shaking immediately

    2. averageFlying > max(1, QPS*minRT/1e3)

      • averageFlying = MovingAverage(flying)

      • When calculating the moving average (flying), the default value of the super parameter beta is 0.9, indicating the average flying value of the first ten times

      • There are three ways to get the flying value:

        1. After the request is added, update averageflying once, as shown in the orange curve in the figure
        2. After the request, update averageflying once, as shown in the green curve in the figure
        3. Update averageflying once after the request is added and once after the request is completed

        We use the second one, which can better prevent jitter, as shown in the figure:
        flying策略对比

      • QPS = maxPass * bucketsPerSecond

        • Maxpass indicates the successful requests in each valid bucket
        • Bucketspersecond indicates the number of buckets per second
      • 1E3 means 1000 milliseconds, and the unit of minrt is also milliseconds. QPS * minrt / 1E3 gets the average number of concurrent requests at each time point

Use of load shedding

  • An optional activation configuration has been added to the rest and zrpc frameworks
    • Cputhreshold: if the value is set to a value greater than 0, the automatic load shedding mechanism of the service will be activated
  • If the request is dropped, there will be an error in the error logdropreqkeyword

reference material

  • moving average
  • Sentinel adaptive current limiting

Project address

https://github.com/tal-tech/go-zero

Tal Technology