Flow limiting is one of the ways to ensure the high availability of services, especially in the micro service architecture, the flow limiting of interfaces or resources can effectively guarantee the availability and stability of services.
The current limiting measures used in previous projects are mainly ratelimiter of guava. Ratelimiter is based on token bucket flow control algorithm, which is very simple to use, but has relatively few functions.
Now, we have a new choice, sentinel provided by Ali.
Sentinel is a current limiting and fusing middleware provided by Alibaba. Compared with ratelimiter, sentinel provides rich current limiting and fusing functions. It supports console configuration of current limiting and fusing rules, supports cluster current limiting, and visualizes corresponding service calls.
At present, many projects have been connected to sentinel, but this paper mainly makes a detailed analysis of Sentinel’s current limiting function, and does not make a deep study of Sentinel’s other capabilities.
1、 Overall process
Let’s take a look at the overall process
(quoted from sentinel website)
The picture above is from the official website,
From the perspective of design pattern, it is a typical responsibility chain pattern. After external requests come in, they are processed by each node in the responsibility chain, and Sentinel’s current limiting and fusing are realized through these nodes in the responsibility chain.
From the point of view of current limiting algorithm, sentinel uses sliding window algorithm for current limiting. If you want to have a deep understanding of the principle, you have to start with the source code. Next, go directly to Sentinel’s source code.
2、 Source code reading
1. Source code reading entrance and overall process
Read the source code first to find the source code entry. We often use @ sentinelresource to mark a method. We can regard the method marked by @ sentinelresource as a sentinel resource. Therefore, we take @ sentinelresource as the entry point, find its section, and see the work done after intercepting the section, then we can understand the working principle of sentinel. Look directly at the aspect code of the annotation @ sentinelresource.
You can clearly see how sentinel behaves. After entering the sentinelresource aspect, the SphU.entry Method, in which the logic processing of current limiting and fusing will be performed for the intercepted method.
If fusing and current limiting are triggered, blockexception will be thrown. We can specify blockhandler method to handle blockexception. For business exceptions, we can also configure the fallback method to handle the exceptions generated by intercepted method calls.
Therefore, the processing of sentinel fuse current limiting is mainly in SphU.entry Method, the main processing logic is shown in the following source code.
It can be seen that SphU.entry In the method, the process of Sentinel’s current limiting and fusing functions can be summarized as follows:
- Get sentinel context;
- Obtain the corresponding responsibility chain of resources;
- Generate the resource call voucher (entry);
- Each node in the execution responsibility chain.
Next, around these aspects, the service mechanism of sentinel is described systematically.
2. Get sentinel context
Context, as the name suggests, is the context of sentinel current limiting execution, which contains the node and entry information of resource call.
Let’s look at the characteristics of context
- Context is held by thread and bound to current thread by ThreadLocal.
- Content contained in context
This leads to three important concepts of sentinelConetxt，Node，Entry. These three classes are the core classes of sentinel, which provide resource call path, resource call statistics and other information.
Context is the sentinel context held by the current thread.
When entering the logic of sentinel, the context of the current thread will be obtained first. If not, a new thread will be created. When the task is finished, the context of the current thread will be cleared. Context represents the context of the call link and runs through all the entries in the call link.
Context maintains information such as the entry node, the curnode of this call link, and the origin of the call. The context name is the call link entry name.
Node is a statistical wrapper for a resource marked with @ sentinelresource.
The entry node of the current thread resource call is recorded in context.
We can trace the resource calls through the childlist of the entry node. Each node corresponds to a resource marked with @ sentinelresource and its statistical data, such as passqps, blockqps, RT and so on.
Entry is a certificate used in sentinel to indicate whether it has passed the current restriction. If it can return normally, it means that you can access the rear service protected by sentinel, otherwise sentinel will throw a blockexception.
In addition, it stores some basic information about the execution of the entry () method, including the context, node, and the corresponding responsibility chain of the resource. After the subsequent resource call, it also needs to get more information about the entry to perform some remedial operations, including exiting the corresponding responsibility chain of the entry, updating some statistical information of the node, and clearing the context information of the current thread.
3. Get the responsibility chain corresponding to @ sentinelresource tag resource
The responsibility chain corresponding to resources is the place where the current limiting logic is implemented, and the typical responsibility chain mode is adopted.
Let’s look at the composition of the default chain of responsibility
The default processing nodes in the responsibility chain include nodeselectorslot, clusterbuilderslot, statisticslot, flowslot, degresslot, etc. The processor slot chain and all the slots in it implement the processor slot interface. The responsibility chain mode is used to execute the processing logic of each node and call the next node.
Each node has its own function. We will see what these nodes do later.
In addition, the chain of responsibility corresponding to the same resource (@ sentinelresource marked method) is consistent. In other words, each resource corresponds to a separate chain of responsibility. You can see the logic of obtaining the chain of responsibility in the source code: first obtain it from the cache, and then create a new one if there is no resource.
4. Generate the call voucher entry
The generated entry is ctentry. The construction parameters include the resource wrapper, the responsibility chain corresponding to the resource and the context of the current thread.
You can see that the new ctentry records the responsibility chain and context of the current resource, and updates the context to set the current entry of the context to itself. As you can see, ctentry is a bidirectional linked list, which constructs the call link of sentinel resources.
5. Implementation of responsibility chain
Next, the implementation of the chain of responsibility. Both the responsibility chain and its slots implement processorslot. The entry method of the responsibility chain will execute each slot of the responsibility chain in turn, so we will enter each slot of the responsibility chain. In order to highlight the key points, this paper only studies the slot related to the current limiting function.
5.1 nodeselectorslot — get the node corresponding to the current resource and build the node call tree
This node is responsible for obtaining or constructing the node corresponding to the current resource, which is used for the statistics of subsequent resource calls and the judgment of current limiting and fusing conditions. At the same time, nodeselectorslot will also complete the call link construction. Look at the source code:
Familiar code style. We know that one resource corresponds to one chain of responsibility. There is nodeselectorslot in each call chain. The node cache map in nodeselectslot is a non static variable, so the map is only shared for the current resource. The caches of nodeselectslot and node corresponding to different resources are different. The relationship between resource and node cache map is shown in the figure below.
So the function of nodeselectorslot is:
- When the call chain corresponding to the resource is executed, the node corresponding to the current context is obtained, which represents the call situation of the resource.
- Set the obtained node as the current node and add it after the previous node to form a tree call path. (through the current entry in context)
- Trigger the execution of the next slot.
An interesting question here is why we use the name of context instead of the name of sentinelresource when we get the node corresponding to the resource in the nodeselectorslot of the responsibility chain?
First of all, we know that a resource corresponds to a chain of responsibility. However, the context of a resource call may be different. If the resource name is used as the key to obtain the corresponding node, then the node obtained by calling methods in different contexts will be the same. Therefore, in this way, nodes corresponding to the same resource can be distinguished by context.
For example, SThe implementation of entinel function can be realized not only by @ sentinelresource annotation method, but also by introducing the related dependency (sentinel Dubbo adapter) and using Dubbo’s filter mechanism to protect Dubbo interface directly. Let’s compare the difference between @ sentinelresource and Dubbo to generate context
The name of the generated context is: Sentinel\_ default\_ context。 The context corresponding to all resources is this value.
Dubbo filter mode
The name of the generated context is the interface qualified name or method qualified name of Dubbo.
If there are resource calls of other sentinelresources nested in Dubbo filter mode, different contexts will appear for these resource calls.
So there is a case where different Dubbo interfaces come in, and all these Dubbo interfaces call the same @ sentinelresource marked method, then the corresponding context of sentinelresource corresponding to this method is different during execution.
Another problem is that since resources are classified into different nodes according to context, what do we want to do about the statistics of total resources? This involves clusternode. See clusterbuilderslot for details.
5.2 clusterbuilderslot — aggregating nodes of the same resource with different contexts
This node is responsible for aggregating nodes corresponding to different contexts of the same resource for subsequent current limiting judgment.
As you can see, clusternode is obtained by using the resource name as key. Clusternode will become an attribute of the current node. Its main purpose is to aggregate multiple nodes under different contexts of the same resource. The default current limiting condition judgment is based on the statistical information in clusternode.
5.3 statisticslot — resource call statistics
This node is mainly responsible for the calculation and update of resource call statistics. Different from the previous and subsequent slots, the execution of statisticslot triggers the execution of the next slot first, and the logic will not be executed until the next slot is executed.
This is also very easy to understand. As a statistical component, statistics can only be done after the fusing or current limiting process is completed. Let’s take a look at the specific statistical process.
The above figure has clearly described the statistical slot data statistics process. You can notice that there are no exceptions and blocking exceptions, mainly the number of update threads, the number of through requests and the number of blocking requests. Both defaultnode and clusternode inherit from statisticnode. So the data update of node comes to statisticnode.
Referring to sentinel data statistics block diagram, the general process of node statistics update is described as follows:
We start from StatisticNode.addPassRequest () method, taking passqps as an example, to explore how statisticnode updates the QPS count through the request.
It can be seen from the source code that the counting variables rollingcounterinsecond and rollingcounterinminute are metric, and the time dimensions of the two variables are seconds and minutes respectively. Rollingcounterinsecond and rollingcounterinminute use the implementation class arraymatric of metric.
Going back from arraymetric:
Statistical information is saved in arraymatric data, that is, leaparray < mercacbucket >.
Leaparray is an array of time windows. The basic information includes: time window length (MS), sampling number (that is, the number of time windows, samplecount), time interval (MS, intervalinms), and time window array (array). The length of time window, the number of samples and the time interval have the following relations
windowLengthInMs = intervalInMs / sampleCount
The intervalinms used in rollingcounterinsecond is 1000 (MS), that is, 1s, samplecount = 2. Therefore, the window length is windowslengthinms = 500ms. The intervalinms used by rollingcounterinminute is 60 * 1000 (MS), which is 60s. Samplecount = 60, so windowslengthinms = 1000ms, that is, 1s.
The type of time window array is atomicreferencearray. It can be seen that this is an array reference of atomic operation. The array element type is windowwrap < metricbucket >. Windowwrap is a package of time window, including the window start time, the window length in MS, and the counter (value, type is metricbucket) of this window. The actual counting of the window is done by metricbucket, and the counting information is saved in the metricbucket counter (type is (longadder)). You can see the composition block diagram of the counting component in the following figure:
go back to StatisticNode.addPassRequest Methods: using ro as an example llingCounterInSecond.addPass (count) is taken as an example to explore how sentinel counts sliding windows.
5.3.1 get the current time window
(1) Take the array subscript corresponding to the current timestamp
long timeId = time / windowLength
int idx = (int)(timeId % array.length());
Time is the current time, windowlength is the time window length, and the time window length of rollingcounterinsecond is 500ms. Array is the number of time windows per unit time. The number of time windows per unit time (1s) of rollingcounterinsecond is 2. Timeid is the current time divided by the time window. Each time the window length is increased, the timeid will increase by 1, and the time window will slide forward.
(2) Calculate window start time
Window start time = current time (MS) – current time (MS)% time window length (MS)
The window start time obtained is an integral multiple of the time window.
(3) Get time window
First, get the time window from the array of leaparray according to the array subscript.
- If the acquired time window is empty, a new time window (CAS) is created.
- If the acquired time window is not empty, and the start time of the time window is equal to the start time of our calculation, it means that the current time is just in this time window, and the time window will be returned directly.
- If the acquired time window is not empty and the start time of the time window is less than the calculated start time, it means that the time window has expired (a scene that has been a long time since the last acquisition of the time window), the time window needs to be updated (locking operation), set the start time of the time window to the calculated start time, and reset the counter in the time window Set to 0.
- If the acquired time window is not empty and the start time of the time window is greater than the calculated start time, a new time window is created. Generally, this branch will not enter this branch, because it indicates that the current time has fallen behind the time window, and the time window obtained is the future time, so it is meaningless.
5.3.2 accumulation of counters in time window
The time window counter is a long adder array, which is used to store the number of passed requests, abnormal requests, blocked requests and other data. As shown in the figure below:
Among them, the entry method of statisticslot is updated by counting, blocking counting and exception counting. The success count and response time are updated when the exit method of statisticslot is executed. In fact, it is to update the corresponding count before and after the execution of the intercepted method. Of course, addpass is an accumulation on the first element of the count array.
The count array element type is longadder. Longadder was added to JUC by jdk8. It is a thread safe “counter” with better performance than atomic * tools.
5.4 flowslot — current limiting judgment
Flowslot is the node to judge the current limiting condition. The previous statistics on related resource calls in statisticslot will be used in flowslot current limiting judgment.
Go directly to the core logic of current limiting operation flow rule checker:
The main processes include:
- Get the current limiting rules corresponding to the resource
- Check whether the current is limited according to the current limiting rules
If the current is limited, a flow exception will be thrown. Flowexception inherits from blockexception.
What is the process of flowslot checking whether the current is limited?
By default, the node used for current limiting is the cluster node of the current node. The main current limiting method is QPS. Let’s take a look at the key code (defaultcontroller) of the underflow
- Get the current QPS count of the node;
- Determine whether the new count exceeds the threshold
- If the value exceeds the threshold value, it will return false, indicating that the current is limited, and a flowexception will be thrown later. Otherwise, it returns true and will not be current limited.
We can see that the current limiting judgment is very simple, just need to check the QPS count. This is due to the statistics done by statisticslot.
5.5 summary of responsibility chain
From the explanation above, let’s look at the picture below. Is it very clear?
(quoted from sentinel website)
Nodeselectorslot is used to obtain the node corresponding to the resource, build the node call tree, and group the call links of sentinelsource in the form of node tree. Clusterbuilderslot creates the corresponding clusternode for the current node, aggregates nodes of different contexts corresponding to the same resource, and the subsequent current limiting basis is this clusternode.
Clusternode inherits from statisticnode and records some statistical data processed by corresponding resources. Statisticslot is used to update the relevant count of resource calls, which is used for subsequent current limiting judgment. According to the call count of the node corresponding to the resource, flowslot judges whether to limit the current. So far, Sentinel’s chain of responsibility execution logic is complete.
6. Finishing work of sentienl
Whether the execution is successful or failed, or blocked, it will be executed Entry.exit () method, take a look at this method.
- Judge whether the entry to exit is the current entry of the current context;
- If the entry to exit is not the current entry of the current context, it will not exit the entry, but exit the current entry of the context and all its parent entries, and throw an exception;
- If the entry to exit is the current entry of the current context (this is normal), exit all slots of the responsibility chain corresponding to the current entry first. In this step, statisticslot will update the success count and RT count of the node;
- Set the current entry of the context as the parent entry of the exited entry;
- If the parent entry of the exited entry is empty and the context is the default context, the default context will be exited automatically (ThreadLocal will be cleared).
- Clear the context reference of the exit entry
By reading Sentinel’s source code, we can clearly understand Sentinel’s current limiting process. The reading of the above source code is summarized as follows:
- Context, entry and node are the core components of sentinel, and all kinds of information and resource calls are held by them;
- The responsibility chain mode is used to complete Sentinel’s information statistics, fusing, current limiting and other operations;
- In the responsibility chain, nodeselectslot is responsible for selecting the node corresponding to the current resource and constructing the node call tree;
- In the responsibility chain, clusterbuilderslot is responsible for building the clusternode corresponding to the current node, which is used to aggregate nodes of the same resource corresponding to different contexts;
- Statisticslot in the responsibility chain is used to count the current resource calls and update various statistical data of node and its corresponding clusternode;
- The flowslot in the responsibility chain limits the current according to the statistics of the clusternode (default) corresponding to the current node;
- Resource call statistics (such as passqps) use sliding time window for statistics;
- After all the work is finished, execute the exit process, add some statistical data, and clean up the context.
By Sun Yi