Build the pit 02 on the road of risk control system – risk analysis πž“œ experience sharing of a CPO


In the previous chapter, “building the pit 01 on the road of risk control system – information collection”, we introduced the first point, how to get enough data, and the next thing is to create a mechanism to deal with these information flexibly, provide basic materials for automatic analysis and capture of risk events, and then analyze the risk events with the help of rule engine.

Before we start, let’s review four main business risk control tasks:

1. Get enough data
2. Make enough flexible analysis platform to analyze data
3. Output risk events to block risk
4. Quantifying the value of risk interception and continuously analyzing cases for strategy optimization

Next, there are three things to consider:

1、 Enable analysts to quickly query the original log

Logs are not simply saved. From the demand of risk control analysis, it is a very high-frequency behavior to search information in a long span through the dimensions of IP, user name, equipment, etc., but also in specific types of logs, such as the demand to search according to specific conditions in the order log or payment log.

These are mainly for the purpose of enabling analysts to quickly restore risk cases. For example, if a stolen case is obtained from customer service, it is now necessary to query what the user has done in the stolen time period from the log. If there is an interface to query in this process, it is obviously much faster than using grep to query in a large number of files, and The learning threshold is much lower.

If the log has been standardized, subsequent business language translation can also be carried out to convert the obscure log fields into the business language that ordinary employees can understand, which can also greatly improve the speed of analysts reading the log when restoring case.

2、 Real time or fixed time computing processing messages into Variables & files

For example, when analyzing the case of an account stolen, it is often necessary to compare the IP address logged in during the stolen period with the IP address commonly used in the user’s history. Even if we can quickly query the original log now, it is a very time-consuming work to filter all the historical login IP addresses of a user and see the proportion of stolen IP in the history.

For another example, when our risk control engine automatically judges whether the user’s current login IP is a common IP, it is also a very “expensive” behavior to query and aggregate in the original log every time.

Then, if we can predefine these variables and calculate them well in advance, we can save a lot of time for the rule engine and labor, and according to the different properties of these variables, the calculation methods are also different. But fortunately, we have a standard to distinguish: frequent, time sensitive use of real-time computing (such as access frequency, time interval); relatively infrequent, time sensitive use of fixed-time computing (such as the user’s commonly used IP, devices, even if less short-term login records, will not be greatly affected).

3、 Select rule engine to run human policy automatically

An elegant rule engine is designed to transform the analyst’s experience decision-making and data into the core module of risk output. First, why rule engine is needed rather than hard coding logicβ€”β€”

The author has encountered this scenario countless times. The strategy just launched in the morning, within an hour, the attacker or fraudster has tried to bypass the strategy. If your risk control logic is hard coded, Congratulations, and go through the development test release process again.

Quick response is the lifeline of security. It’s unimaginable that there are more frustrating things than being beaten for 48 hours and then reacting to block your face.

Therefore, the policy engine must be able to decouple the policy logic from the business logic, so that the defender can flexibly configure the rules to verify and take effect online in silent mode, and can adjust them at any time.

There are many similar open source frameworks, each with its own advantages and disadvantages. However, if we need to reduce the learning curve, we must carry out a layer of packaging (here is another big topic, which we will skip first).

Pit mark:

1. Sharding will affect your strategy

In order to support concurrency and performance, we usually use sharding when we use cluster to calculate variables.

Sharding will allocate data to different computing units according to IP, and when reading the results, it will go to a certain machine in the cluster according to IP to get the data, so as to greatly improve the ability of concurrent processing to read the calculation results.

Now, if I want to get the data by a user, I will find that the information of a user under different IP addresses is saved on different servers, so a single sharding allocation is certainly unreasonable, which must be noted.

2. The variables used in the strategy need not be calculated on site

Some simple variables used in the design of policy engine are calculated in the database on the spot. Although it can greatly improve the flexibility (new variables do not need to consider the historical data supplement), it will greatly affect the stability and response time, especially when business requests break out, there will almost be no response.

We need to know that business R & D is not so sensitive to security results, but if there is a problem that leads to application instability and causes trouble to people, it may be sooner or later to be abandoned, so variables must be calculated in advance as much as possible, and a caching mechanism must be set up.

3. Fully understand the computing resources to be used in risk analysis

It is no exaggeration to say that the amount of real-time and quasi real-time calculation for qualified risk analysis is larger than the sum of all calculations in the application or even more than several times.

In fact, this is also well understood. For example, in a typical login scenario, the main logic of the business is to check whether the password and account identity match, and risk control needs to pull out all the historical files of the logged in user and see them all, and then decide whether to release them according to the risk control strategy. Therefore, when planning the resources to be used in risk analysis, please do not be stingy, and evaluate the resource demand of risk analysis according to the business 5X or even 10X standard.

If information collection mainly focuses on the communication and coordination ability of safety product managers, the design of risk analysis function is more to test the logical thinking ability of safety product managers.

At such a stage, the external miscellaneous communication and coordination has ended, but how to maximize the use of the foundation laid in the early stage requires a very clear understanding of the risk analysis and decision-making process, and here is a better standard to test:

If the design of the analysis platform is poor, only the designer can use it;

Well designed, you will find that customer service and analysts who handle complaints will be happy to use your analysis platform to solve problems for them.

Anti reptile

Author introduction

Liu Ming, co founder and chief product technology officer of Yao’an Technology
More than 6 years of risk control and product related experience, once worked in Netease, responsible for the security of China account system of world of Warcraft. Now, he leads the risk control team of Yao’an Internet business to provide customers with risk control services including star products warden and red. Q.