Rate limit is widely used in many business scenarios by limiting the frequency of calling API to prevent API overuse and protect API from accidental or malicious use. Recently, Chen Zhuo, a cloud system development engineer of youpai cloud system, was invited to share in the open talk open class entitled “practice of cloud gateway rate limitation”, explaining in detail the current commonly used algorithms and the implementation and configuration details based on the gateway nginx / openresty. The following is the arrangement of live sharing content. To view the video, please click to read the original.
Gateway rate restriction is a defensive service measure. Public services need to protect themselves from overuse by it. There are three main benefits of using rate limit
- Enhance user experience: users will always face some resource enhancement and sharing problems when using public services, such as CPU. When a user uses API intentionally or unintentionally, it will inevitably cause some impact on other users.
- More secure: our services, CPU, and memory are all limited to a certain extent. Excessive access will inevitably affect the stability of the service. If there are four services, each service can carry 100 requests. When one of the services has more than 100 requests, it may be down. When the other three services receive more than 100 service requests, they will continue to be down, which will cause the service to be unavailable.
- Reduce the cost: many services are now put on the public cloud, and there are costs for memory, CPU and traffic. Some pay as you go and spend as much as you can. This situation will generate some unnecessary expenses.
Several algorithms of ratelimit
Firstly, four kinds of rate limiting algorithms are introduced, which are leaky bucket, token bucket, fixed windows and sliding windows. Many limiting measures are based on these algorithms. Although the visual understanding of leaky bucket and token bucket seems different, the two algorithms are very similar in the underlying implementation and achieve similar results. Fixed window and sliding window belong to another category. Sliding window is based on fixed window.
As shown in the figure above, all user requests are put into the bucket. When the bucket is full, the request will be rejected. For example, there will be a request through the bottom every 6 seconds, which means that there must be a limit for each request to pass through every 6 minutes.The characteristic of leaky bucket algorithm is that the rate of request is constant and the traffic can be shaped very evenlyEven if there are 100 requests at the same time, they will not be passed at one time, but will be released slowly at a certain interval, which is very friendly for the back-end service to meet the sudden traffic.
Token bucket, as the name implies, there are some tokens in the bucket. These tokens will be put into the bucket at a certain rate. If there are 10 requests per minute, 10 tokens will be put into the bucket. When the request comes in, you need to take the token in the token bucket first. If you get the token, the request will be released. If the bucket is empty, the request will be rejected.
It should be noted that the number of tokens is released at a certain rate, with 10 tokens per minute, so the requests that can be passed must also be 10 per minute. If you put a token at a constant speed and put a token in 6 seconds, the final result is the same as 10 tokens per minute.
Implementation of leaky bucket algorithm
Since token bucket and leaky bucket have the same effect, this paper mainly talks about the algorithm and implementation of leaky bucket. Let’s assume that the rate limit is three requests per minute, that is, one request is released every 20 seconds. As shown in the figure, suppose the first request comes in at 10 seconds, because there has been no request to enter before, so the request is allowed to pass. Record the last access time, which is the time point of this request passing.
Now another request comes from the 20th second. After 10 seconds of 20 seconds relative to 10 seconds, only 0.5 requests are allowed to pass according to calculation, and the request is rejected. This last value is also the time when the last request passed. There is another request in the 30th second: if 30 seconds is regarded as the last update time, it is equivalent to 30 seconds minus 10 seconds, that is, 20 seconds have passed. However, our limit is to allow one request every 20 seconds, then the request will be put away, and the last value has now changed to 30 seconds.
Through the above analysis, it can be found that the leaky bucket limit is very strict. Even if the request comes in at the 29th second, it cannot be passed because it must take 20 seconds to allow a request to pass. This may bring a problem to the business: for example, three requests are allowed per minute, and the user may need to be in the top 10 This requirement will not be allowed in this algorithm. Because the interval between sending the first request and sending the second request must be 20 seconds. In order to make up for this defect, we need to refer to another parameter burst to allow sudden burst of requests. As shown in the figure below, only 10 seconds have passed between 40 seconds and 30 seconds. According to the previous algorithm, only 0.5 requests are allowed to be accessed, which should be rejected. However, we allow it to access one more request in advance (burst = 1), and the result is 0.5 + 1 = 1.5 requests.
It should be noted that although our current time is 40 seconds, we need to update the request time to 50 seconds at last. This is because the excessive use has entered the next time period, which is equivalent to releasing a request in advance. The last last time is 30 seconds, which should be increased by 20 seconds to 50 seconds. This is also a feature of the algorithm implementation, many algorithms also have the function of burst, that is to allow access in advance.
Another request came in 45 seconds, although we allowed it to access in advance when it came. However, since the last access time was 50 seconds, and when less than one request was calculated, the request was rejected, and the timestamp last was still 50 seconds.
The core of the leaky bucket algorithm is that we save the last passing time when we implement it. When a new request comes, we subtract the previous time from the current time, and then get the number of requests that can be passed. If it can pass, it will change the last request time to the current time; if it can’t, the current last request time remains unchanged.If we want to add the function of burst, that is, when it is allowed to access the number of requests in advance, the last time may not be the time of the last time, but how many requests have it increased compared with the time of the last request before. And the last time may exceed the time of the request. Generally speaking, the main core variable is last And burst.
Open source library of leaky bucket / token bucket algorithm
There are also many open-source libraries for leaky buckets and token buckets. The following libraries are very classic and have been implemented in various languages and packages. What’s more, I’m mainly familiar with Lua and golang. Here I mainly talk about them:
- Openresty / Lua resty limit traffic (two variables)
Nginx uses leaky buckets. You are interested to have a look. We will talk about how to configure restrictions in nginx later. In other words, it is based on the number of five parameters allowed to be used in each request in the framework of openlust, that is to say, the number of two parameters allowed to be put in advance in each module is mainly based on the number of five parameters allowed in the framework of openlust.
Uber is a rate restriction implemented by the go language within Uber company. Unlike the previous Lua code without locking, this algorithm adds an optional lock. In my opinion, in high concurrency scenarios, optional locking is a good choice, because there will be a get and set operation. In order to ensure accuracy, locks must be added. You can also go and have a look.
In nginx configuration, consider limiting dimensions first. For example, each user is only allowed to access twice a minute, which is restricted according to the user’s latitude, IP and host, and according to a server, for example, a server can hold 10000 per second, and more than 10000 may be dropped.
These restriction dimensions mentioned above can be implemented in nginx, and the implementation method mainly depends on two modules of nginx: NGX_ http_ limit_ req_ Module and NGX_ http_ limit_ conn_ Module, that is to limit the number of requests and connections.
Fixed window is one of the best understood algorithms. It is very easy to implement in distributed restricted scenarios because it does not need to be locked.
As shown in the figure is a timestamp window, we now specify 50 requests per minute. The first request came in 30 seconds, 49 requests came in 40 seconds, and now 50 requests come in one minute. As the limit of 50 requests per second has been reached, when another request is made in 50 seconds, it will be directly bounced off. When it comes to the next minute, even if 50 requests come at once, it’s already the next minute.
Through the analysis, you can see that the fixed window algorithm is really very simple, your program only needs to store the number of requests in the current time window. As for non locking, we directly add variables through atomic operation. After adding, we need to pay attention to whether the request is rejected if it exceeds 50. If it does not exceed 50, the request will be received. Therefore, there will be no get and set situations.
Of course, this also has a disadvantage. As shown in the above figure, the time range from 00:30 to 01:30 is also a 60 second time range. However, it has 100 requests, which is different from the requirements of our restrictions, and there will be traffic peaks. Therefore, this algorithm can only ensure that there are no more than 50 requests in a fixed window. If it is within a random non fixed window, it is likely to have more than 50 requests. In view of this situation, the concept of sliding window is proposed.
As shown in the figure, there are 50 requests per minute. One minute of sliding window refers to the number of requests one minute ahead of the current time. For example, before 01:15 is equivalent to 0:15 to 01:15.
It is known that there are 18 requests from 01:00 to 01:15, but how many requests are there between 00:15 and 01:00? What we know now is that there are 42 requests between 00:00 and 01:00, and the feature of sliding window algorithm is that it is proportional. This minute can be divided into two periods, the first 15 seconds and the last 45 seconds. According to this ratio, it calculates the number of requests from 00:15 to 01:00. It’s not very accurate to scale because it only records the total. By calculating
rate=42((60-15)/60)+18=42 0.75 + 18=49.5 requests
The total number of requests is 49.5. The current request should be rejected.
Through the above operation, it can be found that the sliding window can ensure that the passing value and the limit value are close to each other through the proportion. Of course, this kind of inaccuracy can be improved by reducing the window time. For example, if the window is 1 minute, you can reduce it to 10 seconds. In this way, the probability of error will be reduced. However, if the window is reduced to 10 seconds, the additional storage cost will be very high. Although this algorithm has some shortcomings, it is also used by many companies.
Whether the sliding windows is accurate
Cloudflare’s analysis of 400 million requests from 270000 different sources shows that:
- 0.003% of requests were incorrectly allowed or limited
- The average difference between the actual rate and the approximate rate is 6%
- Although traffic is generated slightly above the threshold (false positives), the actual occurrence rate is less than 15% higher than the threshold rate
Many big companies are also using the sliding window algorithm. If you can tolerate 40 or 60 windows per minute when you limit 50, this algorithm is also feasible.
Fixed window / sliding window application
Just now we have mentioned that the sliding window restriction algorithm does not need to be locked, but can be operated by atomic operation, so the implementation is very simple.
- Openresty / Lua resty limit traffic (atomic atomic operation)
Only 100 code, here uses an increment atomic operation, does not need to lock, multi thread, multi process implementation is more friendly, the cost is very small.
This is a sliding window application implemented by Kong, but there is a lot of code. You are interested to have a look. It is also Lua’s code. The implementation of this sliding window is relatively complete.
Distributed rate limit
Most of the time, there may be more than one gateway. When there are two machines, the synchronization operation should be performed. For example, the last value should be synchronized in the leaky bucket algorithm. The synchronization policy can use the DB library. However, DB database synchronization is suitable for scenarios with small amount of requests. When the number of requests is very large, you can use redis, a high-speed memory library, which makes synchronization faster. Of course, the above two kinds of restrictions can be used when they are more precise. If they are not particularly accurate, we just need to prevent the service from being destroyed. I think we can use local restrictions.
What are the limitations of local? We just mentioned that there are 50 requests per minute. If you have two nodes, you can allocate 25 requests per node on average. This scheme is feasible. If 10% of the traffic goes to one side and 90% of the traffic goes to the other side, the weight of one node can be increased to 45 nodes per minute for one node and 5 for the other node per minute. In this way, some DB middleware can be avoided.
In the face of this distributed business scenario, apifix is well implemented（https://github.com/apache/inc…）It is a library based on openresty, which directly uses redis as synchronization and realizes it through fixed window method. Another good thing is goredis（https://github.com/rwz/redis-…）The leaky bucket algorithm is implemented based on golang library. The cost of goredis is slightly higher. If it is distributed, the cost of fixed window and sliding window will be much lower.
Performance optimization of distributed rate limit
As mentioned above, every request must read the data of redis or DB class. For example, the fixed window reads the count value. If the count value is subtracted by 1 in redis, the delay will be increased. In this case, there will be some extra overhead. In order to solve this problem, some open source enterprise solutions advocate the method of not real-time synchronization of values. If there are two requests per minute, when node1 receives a request, it does not immediately execute the operation of last minus 1 to redis, but waits for a period of time, for example, to synchronize once a second, and then synchronize to redis, so as to reduce the number of synchronization.
But this kind of operation also brings a new problem. If the current limit is to allow two requests per second, node1 and node2 have two requests in one second at the same time. Since it has not reached one second, they are only counted locally, so these four requests are let go. When it comes to 1 second, when you go to redis for devaluation, you will find that four requests have been released. However, this can be compensated by limiting it to the next second when no request can be passed.
Of course, this situation depends on your tolerance, which is also a solution, but this solution is still less. Kong is one of them. It is a gateway product developed based on openresty. It realizes the function of timing synchronization that we talk about, and does not need to synchronize count value in real time. Another is cloudflare, which also uses sliding windows to solve performance problems, but it is not open source.
- Leaky bucket and token bucket are very classic, you can find their own familiar language algorithm implementation, restriction is more accurate.
- Fixed window is easier to implement, lock free, and suitable for distribution, but there will be traffic peak. It can be used in some scenes where the limitation does not need to be so smooth, and the limitation is relatively accurate.
- Sliding window is easy to implement, and also suitable for distributed, and there will be no peak traffic problem, but the limit will have deviation, if you want to use it, you need to tolerate the problem of limiting deviation.
The above is the main content shared by Chen Zhuo in the open talk open class of yapai cloud. See the link below for the video and PPT of the speech: