Case and solution summary page: Alibaba cloud real time computing product case & solution summary
For an Internet product, typical risk control scenarios include: Registration risk control, landing risk control, transaction risk control, activity risk control, etc., and the best effect of risk control is to prevent trouble before it happens. Therefore, among the three implementation schemes before and after the event, pre-warning and in-process control are the best.
This requires that the risk control system must be real-time.
This paper introduces a real-time risk control solution.
1. Overall structure
Risk control is the product of business scenarios. The risk control system directly serves the business system. Related to it are punishment system and analysis system. The relationships and roles of each system are as follows:
- Business system, usually app + background or web, is the carrier of Internet business, and the risk is triggered by business system;
- The risk control system provides support for the business system, and judges whether the current user or event is at risk according to the data or embedded point information sent from the business system;
- Punishment system, which is called by the business system according to the results of the risk control system, controls or punishes the users or events at risk, such as adding the verification code, limiting the login, forbidding the order, etc;
- The analysis system, which is used to support the risk control system, measures the performance of the risk control system according to the data. For example, if the interception rate of a certain strategy suddenly decreases, it may mean that the strategy has failed, for example, the time for the active goods to be highlighted suddenly becomes shorter, and the overall activity strategy may have problems, etc. the system should also support the operators / analysts to find new strategies;
amongWind control systemandAnalysis systemThis is the focus of this paper. To facilitate the discussion, we assume the following business scenarios:
- E-commerce business;
The scope of risk control includes:
- Registration, false registration;
- Landing, illegal number landing;
- Trade, steal customer balance;
- Activities, preferential activities to collect wool;
- Risk control implementation plan: event stroke control, the goal is to intercept abnormal events;
2. Risk control system
The risk control system hasruleandModelThe advantages of the two technical routes are simple and intuitive, strong interpretability and flexibility, so they are active in the risk control system for a long time, but the disadvantage is that they are easy to be broken down. Once they are guessed by the black production, they will fail. Therefore, in the actual risk control system, the model-based risk control link is often combined to increase the robustness. But limited to the space, we only focus on a rule-based risk control system architecture. Of course, if there is a demand for model risk control, the architecture also fully supports it.
The rule is to judge the conditions of things. We assume several rules for registration, login, transaction and activity, for example:
- The user name is inconsistent with the ID card name;
- The number of registered accounts of an IP in the last hour exceeds 10;
- The login times of an account in the last 3 minutes are more than 5;
- One account group recently disappeared to buy more than 100 preferential products;
- More than 3 coupons have been collected from an account in the last 3 minutes;
Rules can be combined into rule groups. For the sake of simplicity, we only discuss rules here.
The rules actually include three parts:
- Fact, that is, the subject and attribute to be judged, such as the account and login times, IP and registration times of the above rules;
- Condition, the logic of judgment, such as a fact whose attribute is greater than an indicator;
- Index threshold, judgment basis, such as the critical threshold of login times, the critical threshold of the number of registered accounts, etc;
Rules can be filled in by operation experts based on experience, or can be mined by data analysts based on historical data. However, because rules can be guessed and lead to failure in the attack and defense against black production, dynamic adjustment is required without exception.
Based on the above discussion, we design a risk control system scheme as follows:
The system has three data flows:
- Real time data flow of risk control is identified by red line and synchronously called, which is the core link of risk control call;
- Quasi real time index data flow, marked by blue line, written asynchronously to prepare index data for real-time risk control part;
- Quasi real time / off-line analysis data flow, marked by green line, written asynchronously to provide data for performance analysis of risk control system;
This section first introduces the first two parts, and the analysis system is described in the next section.
2.1 real time risk control
Real time risk control is the core of the whole system, which is synchronously called by the business system to complete the corresponding risk control judgment.
As mentioned above, rules are often written by people and need dynamic adjustment, so we will separate the risk control judgment part from the rule management part. The rule management background serves the operation, and the operation personnel will carry out relevant operations:
- Scenario management, which determines whether a scenario implements risk control, such as an activity scenario, which can be closed after the end of the activity;
- Black and white list, manual / program to find the black and white list of the system, direct filtering;
- Rule management, management rules, including adding, deleting or modifying, such as logging in to judge the new IP address, such as checking the new frequency of an order;
- Threshold management refers to the threshold of management indicators. For example, if the rule is that the number of registered accounts of an IP in the last hour cannot exceed 10, then 1 and 10 belong to the threshold;
After talking about the management background, the logic of the rule judgment part is very clear, including pre filtering, fact data preparation and rule judgment.
2.1.1 pre filtration
The business system calls the risk control system synchronously after certain events (such as registration, login, order placement, participation in activities, etc.) are triggered, with relevant context, such as IP address, event identification, etc. the rule judgment part will decide whether to make judgment according to the configuration of the management background. If so, then filter the black and white list, and enter the next phase after passing.
This part of the logic is very simple.
2.1.2 real time data preparation
Before making a judgment, the system must prepare some factual data, such as:
- In the registration scenario, if the rule is that the number of registered accounts of a single IP in the last hour does not exceed 10, then the system needs to find the number of registered accounts of the IP in the last hour in redis / HBase according to the IP address, such as 15;
- In the login scenario, if the rule is that the login times of a single account in the last 3 minutes are no more than 5, then the system needs to go to redis / HBase to find the login times of the account in the last 3 minutes, such as 8;
The data output of redis / HBase will be introduced in Section 2.2 quasi real time data flow.
2.2.3 rule judgment
After getting the fact data, the system will judge according to the rules and thresholds, and then return the result, and the whole process is over.
The whole process is logically clear. The rule engine we often say works in this part. Generally speaking, there are two ways to implement this process:
- With the help of mature rule engines, such as drools, drools and Java environment, they are very well integrated and perfect, supporting many features. However, they are cumbersome to use and have a high threshold. Please refer to article ;
- Based on dynamic languages such as groovy, I will not elaborate here. Please refer to article ;
Both schemes support dynamic updating of rules.
2.2 quasi real time data flow
This part belongs to the background logic, which serves the risk control system and prepares the fact data.
The separation of data preparation and logical judgment is considered from the perspective of system performance / scalability.
As mentioned above, it requires relevant indicators of facts to make rule judgment, such as the number of logins in the last hour, the number of registered accounts in the last hour, etc. these indicators usually have a time span and are in a certain state or aggregation. It is difficult to calculate according to the original data in the process of real-time risk control, because the rule engine of risk control is often stateless and will not record the previous results.
At the same time, this part of the original data is very large, because the original data of user activities are transmitted for calculation, so this part is often completed by a streaming big data system. Here we choose Flink. Flink is the undisputed No.1 in the field of current flow computing. Whether it’s performance or function, it can complete this part of work very well.
This part of the data flow is very simple:
- The business system sends the embedded point data to Kafka;
Flink subscribes to Kafka,Complete the aggregation of atomic particle size；
Note: Flink’s only aggregation of atomic granularity is related to the dynamic change logic of rules. For example, in the registration scenario, the operation classmates will judge the number of registered accounts in the last hour of an IP, the number of registered accounts in the last three hours, and the number of registered accounts in the last five hours according to the effect That is to say, the n of the last n hours is dynamically adjusted. In that case, Flink should only calculate the number of accounts for one hour. In the process of judgment, it should read the last three one hour accounts or five one hour accounts according to the rules, and then judge after aggregation. Because in the operation mechanism of Flink, the job will continue to run after submission. If the adjustment logic needs to stop the job, modify the code, and then restart, it is quite troublesome. At the same time, because of the problem of the intermediate state of Flink, restart also faces the problem of whether the intermediate state can be reused. So if Flink completes the N-hour aggregation directly, the above operations need to be repeated for each n change, and sometimes the data need to be tracked, which is very cumbersome.
- Flink writes the summarized index results to redis or HBase for real-time risk control system query. Both of them are not big problems. You can choose according to the scene.
By separating data calculation and logical judgment and introducing Flink, our risk control system can cope with a large number of users.
3. Analysis system
The former thing is a complete risk control system in static view, but there is a lack in dynamic view, which is not reflected in functionality, but in evolution. That is to say, if we look at a risk control system from a dynamic perspective, we need at least two parts, one is to measure the overall effect of the system, and the other is to provide the basis for the rule / logic upgrade of the system.
In measuring the overall effect, we need to:
- Judge whether the rules fail, such as the sudden decrease of interception rate;
- Judge whether the rule is redundant, for example, a rule has never intercepted any event;
- Judge whether there are loopholes in the rules, for example, after holding a promotion or issuing a voucher, the benefits are collected, but the expected effect is not achieved;
In terms of providing rules / logic upgrade basis for the system, we need to:
- Find out the global rules. For example, someone’s spending on electronic products has suddenly increased by 100 times. It’s a problem on its own, but on the whole, many people may have this phenomenon. It turns out that Apple has made new products
- It is normal to recognize the combination of certain behaviors. Single behavior is normal, but the combination is abnormal. For example, it is normal for the user to buy a kitchen knife, a car ticket, a rope, and a gas station, but it is not normal to do these things at the same time in a short time.
- Group identification, for example, through graph analysis technology, finds a group, and then labels all accounts of the group to prevent the situation that each account performs normally, but the whole group is collecting wool.
This is the role orientation of the analysis system. Some of his work is deterministic and some is exploratory. In order to complete this work, the system needs as much data support as possible, such as:
- Business system data, business embedded data, record detailed user, transaction or activity data;
- Data of risk control interception, data of buried points of risk control system, for example, a user is intercepted due to a rule in a state with certain characteristics, which is an event data in itself;
This is a typical big data analysis scenario, and the architecture is also relatively flexible. I just give a suggestion.
Relatively speaking, this system is the most open. It has not only fixed index analysis, but also machine learning / data analysis technology to find more new rules or patterns. Limited to the space, it will not be expanded in detail here.
1. From drools rule engine to risk control and anti money laundering
2. Rules script engine based on groovy
3. Rules based risk control system
4. Practice of strict risk control in Netease
5. Framework design and practice of Netease koala rule engine platform
6. An open source Java risk control system
Author: Fu Kong
Read the original text
This is the original content of yunqi community, which can not be reproduced without permission.