The implementation of a large-scale real-time risk control system based on Flink in Alibaba

Time:2022-11-23

This article is organized by community volunteer Zou Zhiye, and the content is from the speech delivered by Li Jialin (Feng Yuan), the real-time computing product manager of Alibaba Cloud, at the Flink Summit (CSDN Cloud Native Series) on July 5. The main contents include:

  1. Build a risk control system based on Flink
  2. Ali risk control in practice
  3. Difficulties in large-scale risk control technology
  4. Alibaba Cloud FY23 Risk Control Evolution Plan

Click to view live replay & speech PPT

At present, Flink basically serves all BUs of the group, with a computing capacity of 4 billion items per second at the peak of Double Eleven, and more than 30,000 computing tasks, using a total of 1 million + Core; covering almost all specific businesses within the group , For example: data center, AI center, risk control center, real-time operation and maintenance, search recommendation, etc.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

1. Build a risk control system based on Flink

Risk control is a big topic, involving rule engine, NoSQL DB, CEP, etc. This chapter mainly talks about some basic concepts of risk control. On the big data side, we divide the risk control into a 3 × 2 relationship:

  • 2 means risk control is either rule-based, algorithm-based or model-based;
  • 3 represents three types of risk control: pre-event risk control, event risk control and post-event risk control.

1.1 Three types of risk control business

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

For in-event risk control and post-event risk control, the perception on the terminal is asynchronous, and for pre-event risk control, the perception on the terminal is synchronous.

Here is a little explanation for the advance risk control. The advance risk control is to store the trained model or the calculated data in databases such as Redis and MongoDB;

  • One way is to have rule engines like Sidden, Groovy, and Drools on the end to directly fetch data from Redis and MongoDB to return results;
  • The other way is based on Kubeflow KFserving, which returns results based on trained algorithms and models after requests from the end.

Generally speaking, the delay of these two methods is about 200 milliseconds, which can be used as a synchronous RPC or HTTP request.

For the big data scenario related to Flink, it is an asynchronous risk control request, and its asynchronous timeliness is very low, usually one or two seconds. If you pursue ultra-low latency, you can think of it as a kind of risk control in the event, and the risk control decision-making process can be handled by machines.

A very common type is to use Flink SQL for indicator threshold statistics, Flink CEP for behavior sequence rule analysis, and the other is to use Tensorflow on Flink to describe algorithms in Tensorflow, and then use Flink to execute the calculation of Tensorflow rules .

1.2 Flink is the best choice for regular risk control

At present, Flink is the best choice for risk control in the Ali Group. There are three main reasons:

  • event driven
  • Latency in milliseconds
  • Flow batch integration

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

1.3 Three elements of rules and risk control

There are three elements in rule risk control, and all the content that will be discussed later revolves around these three elements:

  • Facts: Refers to risk control events, which may come from business parties or log buried points, and are the input of the entire risk control system;
  • Rules: It is often defined by the business side, that is, what kind of business goals this rule should meet;
  • Threshold Threshold: The severity of the description corresponding to the rule.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

1.4 Enhancement of Flink rule expression

For Flink, it can be divided into two types: stateless rules and stateful rules. Among them, stateful rules are the core of Flink’s risk control:

  • stateless rules: It is mainly for data ETL. One scenario is that when a field value of a certain event is greater than X, the current risk control behavior is triggered; the other scenario is that the downstream of the Flink task is a model or algorithm-based risk control. There is no need to make rule judgments on the Flink side, but to vectorize and normalize the data, such as multi-stream association, Case When judgment, etc., to turn the data into 0/1 vectors, and then push it to the downstream TensorFlow for prediction.
  • stateful rules

    • statistical rules: Calculation rules based on statistical analysis, for example, if the number of visits within 5 minutes is greater than 100, then the risk control is considered to be triggered;
    • sequential rules: In the event sequence, a certain event has an impact on the previous and subsequent events, such as clicking, adding to the shopping cart, and deleting three events. This continuous sequence of behaviors is a special behavior. It may be considered that this behavior is maliciously reducing the merchant’s product value. Evaluation scores, but these three events are not a risk control event independently; Alibaba Cloud’s real-time calculation of Flink has improved the ability of sequence-based rules, providing technical escort for e-commerce transaction scenarios on the cloud and within the group;
    • hybrid rules: A combination of both statistical and sequential.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

2. Ali’s actual risk control

This chapter mainly introduces how Ali satisfies the three elements of risk control mentioned above in terms of engineering.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

From the perspective of overall technology, it is currently divided into three modules: perception, processing and insight:

  • perception: The purpose is to perceive all abnormalities and find problems in advance, such as capturing some data types that are different from the common data distribution, and output a list of such abnormalities; another example is that the sales of helmets increase due to the adjustment of the riding policy in a certain year, and the associated As a result, the click-through rate and conversion rate of related products will increase. This situation needs to be captured in time, because it is a normal behavior rather than cheating;
  • deal with: That is, how to implement the rules. Now there are three lines of defense: hourly, real-time, and offline. Compared with the matching of a single policy before, the accuracy after association and integration will be higher. For example, the association of some users in the recent period Continuous behavior to conduct comprehensive research and judgment;
  • insight: In order to find some risk control behaviors that are currently not perceived and cannot be directly described by rules. For example, risk control needs to represent samples in a highly abstract manner. Some features are found in the dimension to identify new anomalies.

2.1 Phase 1: SQL real-time association & real-time statistics

At this stage, there is a SQL-based evaluation risk control system, using simple SQL to do some real-time association and statistics, such as using SQL for aggregation operations SUM(amount) > 50, where the rule is SUM(amount), and the threshold corresponding to the rule is 50; assuming that 10, 20, 50, and 100 rules are running online at the same time, because a single Flink SQL job can only execute one rule, then it is necessary to apply for 4 Flink jobs for these 4 thresholds. The advantage is that the development logic is simple and the job isolation is high, but the disadvantage is that it greatly wastes computing resources.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

2.2 Phase 2: Broadcast Stream

The main problem of the risk control rules in Phase 1 is that the rules and thresholds are immutable. There are currently some solutions in the Flink community, such as based on BroadcastStream. In the figure below, Transaction Source is responsible for event access, and Rule Source is a BroadcastStream , when there is a new threshold, it can be broadcast to each operator through BroadcastStream.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

For example, it is judged that the risk control object has more than 10 consecutive visits within one minute, but it may have to be changed to 20 or 30 times on 618 or Double 11 before it will be perceived by the online system downstream of the risk control system.

In the first stage, there are only two options: the first is to run all jobs online; the second is to stop a Flink job at a certain moment and start a new job based on a new indicator.

If it is based on BroadcastStream, it is possible to implement the distribution of rule indicator thresholds, and directly modify the online indicator thresholds without restarting the job.

2.3 Phase 3: Dynamic CEP

The main problem in the second stage is that it can only update the index threshold. Although it greatly facilitates the business system, it is actually difficult to satisfy the upper-level business. There are two main appeals: combining CEP to realize the perception of behavior sequences; combining CEP can still dynamically modify the threshold and even the rules themselves.

In the third stage, Alibaba Cloud Flink made a high degree of abstraction related to CEP, decoupling the CEP rules and CEP execution nodes, that is to say, the rules can be stored in external third-party storage such as RDS and Hologres, and after the CEP job is published, the database can be loaded The CEP rules in CEP are used to achieve dynamic replacement, so the expressiveness of the job will be enhanced.

Secondly, the flexibility of the job will be enhanced. For example, if you want to see some behaviors under a certain APP and update the indicator threshold of this behavior, you can update the CEP rules through third-party storage instead of Flink itself.

Another advantage of doing this is that the rules can be exposed to the upper-level business side, so that the business can really write the risk control rules. We become a real rule center, which is the benefit brought by the dynamic CEP capability. In Alibaba Cloud’s services, the dynamic CEP capability has been integrated in the latest version. Alibaba Cloud’s fully managed Flink service greatly simplifies the development cycle of risk control scenarios.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

2.4 Phase Four: Shared Computing

Going one step further on the basis of Phase 3, Alibaba Cloud has implemented a “shared computing” solution. In this shared computing solution, CEP rules can be completely described by the modeling platform, and exposed to upper-level customers or business parties as a very friendly rule description platform, which can be coupled through drag-and-drop or other methods, and then in the scheduling engine Select the Event Incoming Source to run the rule on. For example, the two models now serve the Taobao APP, and they can be completely implemented on the same Flink CEP job of Fact, so that the business side, execution layer, and engine layer can be completely decoupled. Currently, Alibaba Cloud’s shared computing solutions are very mature, with rich customer implementation practices.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

2.5 Stage five: separation of business development and platform construction

Among the three parties on the engine side, platform side, and business side, Phase 4 can achieve decoupling between the engine side and the platform side, but the business side is still highly bound. The working mode of the two is still a collaborative relationship between Party A and Party B, that is, the business side controls the business rules, and the platform side accepts the risk control needs of the business team to develop risk control rules. But the platform team usually gives priority to personnel, and the business team will grow stronger and stronger as the business develops.

At this time, the business side itself can abstract some basic concepts, precipitate some common business specifications, and assemble them into a friendly DSL, and then implement job submission through Alibaba Cloud’s fully decoupled Open API.

Due to the need to support nearly 100 BUs in the group at the same time, there is no way to provide customized support for each BU. We can only open up the capabilities of the engine as much as possible, and then submit the business side to the platform through DSL packaging to truly implement When it comes time, only one middle station is exposed to customers.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

3. Difficulties in large-scale risk control technology

This chapter mainly introduces some technical difficulties of large-scale risk control, and how Alibaba Cloud breaks through these technical difficulties in fully managed Flink commercial products.

3.1 Fine-grained resource adjustment

In stream computing systems, data sources are often not blocked nodes. The upstream data reading node does not have performance problems because there is no computing logic, and the downstream data processing node is the performance bottleneck of the entire task.

Since Flink’s jobs are divided into resources based on slots, by default the source node and the worker node have the same concurrency. In this case, we hope that the concurrency of the source node and the CEP worker node can be adjusted separately. For example, in the figure below, we can see that the concurrency of the CEP worker node of a job can reach 2000, while the source node only needs 2 parallel degrees , which can greatly improve the performance of CEP nodes.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

In addition, the TM memory and CPU resources where the CEP work nodes are located are divided. In the open source Flink, the TM is isomorphic as a whole, that is to say, the source node and the work node have exactly the same specifications. From the perspective of saving resources, the source node does not need as much memory and CPU resources as the CEP node in the real production environment, and the source node only needs a small CPU and memory to meet the data capture.

Alibaba Cloud’s fully managed Flink can enable source nodes and CEP nodes to run on heterogeneous TMs, that is, CEP worker node TM resources are significantly larger than source node TM resources, and CEP work execution efficiency will become higher. Considering the optimization brought about by fine-grained resource adjustment, fully managed services on the cloud can save 20% of the cost compared with self-built IDC Flink.

3.2 Stream batch integration & Adaptive Batch Scheduler

If the flow engine and the batch engine do not adopt the same set of execution modes, they will often encounter inconsistent data calibers. The reason for this problem is that it is difficult to fully describe the flow rules under the batch rules; for example, in Flink, there is a A special UDF, but there is no corresponding UDF in the Spark engine. When such data calibers are inconsistent, which aspect of the data caliber to choose becomes a very important issue.

On the basis of Flink stream-batch integration, the CEP rules described in stream mode can be run again with the same caliber in batch mode and get the same results, so there is no need to develop batch-mode related CEP jobs.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

On top of this, Ali implemented an adaptive Batch Scheduler. In fact, the daily effect output of CEP rules is not necessarily balanced. For example, there is no abnormal behavior in today’s behavior sequence, and there is only a small amount of data input downstream. At this time, an elastic cluster will be reserved for batch analysis; when When the results of CEP are few, the downstream batch analysis only needs very small resources, and even the parallelism of each batch analysis worker node does not need to be specified at the beginning, the worker node can be based on the output of upstream data and task load To automatically adjust the degree of parallelism in batch mode, and truly achieve elastic batch analysis, this is the unique advantage of Alibaba Cloud’s Flink stream-batch integrated Batch Scheduler.

3.3 Combined reading reduces the pressure on the public layer

This is a problem encountered in practice. The current development model is basically based on the data center, such as real-time data warehouse. In the real-time data warehouse scenario, there may not be many data sources, but the middle layer DWD will become many, the middle layer may evolve into many DWS layers, and even evolve into many data marts for use by various departments , in this case, the reading pressure of a single table will be very high.

Usually multiple source tables are associated (widened) with each other to form a DWD layer, which is dependent on multiple DWD tables from the perspective of a single source table. The DWD layer will also be consumed by jobs in different business domains to form DWS. Based on this situation, Ali has implemented source-based merging. You only need to read the DWD once, and the Flink side will help you process it into multiple DWS tables in the business domain, which can greatly reduce the execution pressure on the public layer.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

3.4 State backend for KV separation design

When a CEP node is executed, it will involve very large-scale local data reading, especially in the calculation mode of behavior sequence, because it needs to cache all the previous data or the behavior sequence within a certain period of time.

In this case, a relatively big problem is that there is a very large performance overhead for back-end state storage (such as: RocksDB), which will affect the performance of CEP nodes. At present, Alibaba has implemented the state backend designed for KV separation. Alibaba Cloud Flink uses Gemini as the state backend by default. The measured performance in the CEP scenario has at least 100% improvement.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

3.5 Dimensional data partition loading

In many cases, risk control needs to be analyzed based on historical behavior. Historical behavior data is generally stored in Hive or ODPS tables, and the scale of this table may be at the terabyte level. The open source Flink needs to load this super large dimension table on each dimension table node by default, which is actually unrealistic. Alibaba Cloud realizes the division of memory data based on Shuffle, and the dimension table node will only load the data belonging to the current Shuffle partition.

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

4. Alibaba Cloud Flink FY23 Risk Control Evolution Plan

For Alibaba Cloud as a whole, the FY23 evolution plan includes the following:

  • Enhanced expressiveness
  • Observability Enhancement
  • Enhanced ability to execute
  • performance enhancement

Welcome to use cloud products to experience, give more opinions, and make progress together.

Click to view live replay & speech PPT


The implementation of a large-scale real-time risk control system based on Flink in Alibaba

2022 The 4th Real-Time Computing FLINK Challenge

490,000 bonus is waiting for you!

Continuing the “Encouragement Program” to win generous gifts!

Click to enter the official website of the competition to register for the competition

For more technical questions related to Flink, you can scan the QR code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics at the first time, please pay attention to the official account ~

The implementation of a large-scale real-time risk control system based on Flink in Alibaba

Activity recommendation

Alibaba Cloud’s enterprise-level product based on Apache Flink – real-time computing Flink version is now open:
$99 trialReal-time computing Flink version(Yearly subscription, 10CU) You will have the opportunity to get Flink’s exclusive customized sweater; another package for 3 months and above will have a 15% discount!
Learn more about the event:https://www.aliyun.com/produc…

The implementation of a large-scale real-time risk control system based on Flink in Alibaba