In the last chapter, we introduced the first point, how to obtain enough data, and the next thing is to create a mechanism to deal with these information flexibly, so as to provide basic raw materials for automatic analysis and capture of risk events, and then analyze risk events with the help of rule engine.
Before we start, let’s review the following four aspects of business risk control:
1. Get enough data
2. Do enough flexible analysis platform to analyze data
3. The output risk event is used to block the risk
4. Quantifying the value of risk interception and continuously analyzing cases for strategy optimization
Next, the same three things need to be considered:
1、 So that analysts can quickly query the original log
Logs are not simply saved. From the perspective of risk control analysis, it is very high frequency to search for information in a long span through IP, user name, equipment and other dimensions. At the same time, there is also a demand for searching by specific conditions in certain types of logs, such as in order logs or payment logs.
These are mainly to enable analysts to quickly restore the risk case. For example, a stolen case is obtained from the customer service. Now we need to query what the user did during the stolen period from the log. If there is an interface for query in this process, it is obviously faster than asking the analyst to query in a large number of files with grep The barriers to learning are much lower.
If the log has been standardized, the subsequent business language translation can also be carried out to convert the obscure log fields into the business language that can be understood by ordinary employees. It can also greatly improve the speed of analysts reading logs when restoring case.
2、 Real time processing of information or files
For example, when analyzing the stolen case of an account, it is often necessary to compare the IP address logged in during the stolen period with the IP address commonly used in the user’s history. Even if we can quickly query the original log, it is a very time-consuming work to filter all the historical login IP addresses of a user and check the proportion of stolen IP in the history.
For another example, when our risk control engine automatically determines whether the user’s current login IP is a common IP, it is also a very “expensive” behavior to query aggregation in the original log every time.
Then, if you can predefine these variables and calculate them in advance, you can save a lot of time for the rule engine and labor, and according to the different nature of these variables, the calculation method is also different. Fortunately, we have a standard to distinguish: frequent, time sensitive use of real-time computing (such as access frequency and time interval); and relatively infrequent, time sensitive use of fixed-time computing (such as user’s commonly used IP, device, even if the short-term login records are not counted, it will not be greatly affected).
3、 Select the rule engine to run the human policy automatically
An elegantly designed rule engine is the core module that transforms analysts’ experience, decision-making and data into risk output. First of all, why do we need a rule engine instead of choosing hard coded logic——
The author has encountered this scenario countless times. The strategy just launched in the morning, and within an hour, the attacker or fraudster has tried to bypass the strategy. If your risk control logic is hard coded, congratulations. Go through the development test release process again.
Quick response is the lifeline of security. I can’t imagine anything more frustrating than being beaten for 48 hours by an attacker and then reacting to block his face.
Therefore, the policy engine must be able to decouple the policy logic from the business logic, so that the defender can flexibly configure the rules to be verified in silent mode and come into effect in real time, and can be adjusted at any time.
There are many similar open source frameworks, each with its own advantages and disadvantages. However, if you need to reduce the learning curve, you must carry out a layer of packaging (here is a larger topic, I will skip it first).
Pit location marking:
1. Sharding can affect your strategy
In order to support concurrency and performance, we usually use sharding when using cluster computing variables.
Sharding will allocate the data to different computing units according to IP to process. When reading the results, it will go to a machine in the cluster according to IP to get the data, so as to greatly improve the ability of concurrent processing and reading calculation results.
Now, if I want to press a user to get the data, I will find that the information of a user under different IP addresses is saved on different servers. Therefore, a single sharding allocation is definitely unreasonable, which must be noted.
2. The variables used in the strategy need not be calculated on site
Some simple policy engine design variables are calculated on the spot in the database, although it can greatly improve the flexibility (new variables do not need to consider historical data back filling), but it will greatly affect the stability and response time, especially when the business request bursts, there will be downtime and no response.
We should know that business R & D is not so sensitive to the results of security. However, if there is a problem that leads to application instability and brings trouble to people, it may be sooner or later to be abandoned. Therefore, we must try our best to calculate variables in advance and set up a caching mechanism.
3. Fully understand the computing resources to be used in risk analysis
It is no exaggeration to say that the real-time and quasi real-time calculation of qualified risk analysis is more than the sum of all the calculations in the application, or even more than several times.
In fact, this is also very easy to understand. For example, in a typical login scenario, the most important thing for business logic is to check whether the password matches the identity of the account number. However, risk control needs to pull out all the historical files of the login user, and then decide whether to release according to the risk control strategy. Therefore, when planning the resources to be used in risk analysis, please do not be stingy, and evaluate the resource requirements of risk analysis according to the standard of business 5x or even 10x.
If the information collection mainly focuses on the communication and coordination ability of the safety product manager, the design risk analysis function is more to test the logical thinking ability of the safety product manager.
At such a stage, the external miscellaneous communication and coordination has ended, but how to maximize the use of the foundation laid in the early stage requires a very clear understanding of the risk analysis and decision-making process, and there is also a better standard to test:
If the design of the platform is poor, only the designer can use it;
Well designed, you will find that the customer service and analysts who handle complaints will be happy to use your analysis platform to solve their problems.
Introduction to the author
Liu Ming, co-founder and chief product technology officer of Ma’an Technology
With more than 6 years of experience in risk control and product, he once worked in Netease and was responsible for the account system security of world of Warcraft in China. Now he leads the risk control team of Huaan Internet business to provide customers with risk control services including star products warden and red. Q.