Machine learning algorithm platform and scene based on Flink

Time:2020-9-27

Author: Gao Min (Wu He), senior technical expert of Alibaba

1. Preface

With the “demographic dividend” of the Internet “exhausted”, the conversion rate and effect of machine learning platform and recommendation system based on “t + 1” or offline computing are becoming “flat”. In the post epidemic era, the new social model and economic form will inevitably give birth to new business models. The flow of online business and related application scenarios is showing a blowout development. The conventional offline system and offline machine learning platform can no longer meet the requirements of business development. After the demographic dividend is exhausted, it will be very important for business systems based on big data and AI platform to think about the time dimension. It has become a mainstream trend to get value from time through real-time business system. Online machine learning platform based on streaming computing engine will be paid more and more attention. The quasi real-time or real-time recommendation system based on incremental model can fully capture the rapidly changing needs of target users, so as to accurately recommend and realize. The real-time recommendation system has also expanded from the earliest e-commerce scene to social scene, online education scene, game scene and broader online scene.

This paper focuses on the real-time computing Flink and Pai based on Alibaba cloud big data and AI product family Alink machine learning algorithm platform, as well as the application of the product portfolio in real-time recommendation scenarios (applicable to e-commerce, games and online education solutions), real-time scorecard scenarios (applicable to financial, security and marketing risk control solutions) and anomaly detection scenarios (applicable to industrial fields and other industrial Internet fields).

2. Introduction of real-time computing engine and machine learning algorithm platform

2.1 alicloud real time computing Flink

As a commercial product of the founding team of Apache Flink, Alibaba cloud real-time computing Flink provides the possibility for enterprise big data processing and business real-time from the extreme (compared with traditional micro batch mode) real-time data processing dimension. The commercial unified development and control platform, mature and quasi standardized SQL and metadata management capabilities have greatly improved the development efficiency of business personnel and data analysts. SQL and UDF can basically solve 80% + business scenarios. Enterprise class state backend – Gemini greatly improves IO efficiency, and the performance of the overall execution engine is more than three times higher than that of open source.

Machine learning algorithm platform and scene based on Flink

Based on Alibaba cloud kubernetes’s new serverless fully managed cloud real-time computing Flink service, it uses a new hard multi rent technology solution, provides network layer isolation based on VPC, provides computing layer isolation based on alicloud security container, provides storage level isolation based on elastic cloud disk, and realizes multi tenant isolation under extreme resource elasticity through user level master and super master. Fine grained elastic scaling based on load can fully improve resource utilization and reduce overall TCO. The new generation of server less real-time computing Flink products provide a solid (“time”) foundation for online machine learning algorithm platform.

Machine learning algorithm platform and scene based on Flink

2.2 alicloud Pai alink machine learning algorithm platform

Machine learning algorithm platform and scene based on Flink

Compared with sparkml algorithm, alink algorithm is more comprehensive, better performance, richer scenarios (support stream batch), and better localization (support Chinese word segmentation). It is the only choice for rapid construction of online machine learning system.

Machine learning algorithm platform and scene based on Flink

3. Introduction of Flink machine learning scenario based on real-time computing:

3.1 real time recommendation scenario:

From the real-time e-commerce scenarios that users click and browse, to the real-time “feed” real-time recommendation system of social media according to the user’s reading content, to the game system that the game push platform pushes according to the user’s behavior in real time, the real-time recommendation system has become the core of online business system.

Machine learning algorithm platform and scene based on Flink

Ali cloud Pai alink algorithm platform provides: recall (e.g., ALS, FM, deep walk, etc.), feature coding (onehot, multihot, gbdt, etc.), sorting (LR and FFM, etc.) and online algorithm (onlinefm and ftrl) flow and batch algorithm, and the whole process construction capability. With Alibaba cloud’s real-time computing Flink massive sample real-time splicing ability, it can quickly realize the off-line integrated recommendation system from end to end.

Machine learning algorithm platform and scene based on Flink

The first mock exam model is built by feature engineering batch training, and the incremental model is generated through real-time sample splicing and flow algorithm (OnlineFM and Ftrl). Finally, the overall prediction results of unified model are provided, and the recommendation effect is enhanced more timely and dynamically.

Machine learning algorithm platform and scene based on Flink

3.2 score card scenario introduction:

Alicloud real-time computing Flink and Pai alink product portfolio can help customers quickly build real-time financial risk control solutions. Scorecard is widely used in financial scenarios. Whether an accurate scorecard model can be built is related to whether payment, loan, insurance, financial management, credit and other businesses can be carried out safely. Scorecard is often used in the field of credit evaluation, such as credit card risk assessment and loan issuance. Scorecard can also be used as score evaluation, such as customer quality scoring and credit score. Scenarios involving finance need to be traceable, auditable and interpretable. The following scorecard model has good interpretability. For example: the user is 27 years old, male, marital status, married, bachelor degree, monthly income of 10000. According to the following score card, the user’s score is: score = 223 (benchmark score) + 8 (age) + 4 (gender score) + 8 (marital status) + 8 (education score) + 13 (monthly income score) = 264.

Machine learning algorithm platform and scene based on Flink

Alicloud real-time computing Flink and Pai alink product portfolio provides the most advanced scorecard solution, which trains each feature in boxes according to requirements; scorecard training generates scoring model; sample stability measures sample stability through psi and other indicators; model evaluation evaluates the effect of two classification model. The solution supports multi feature dimension model training and large-scale sample modeling.

Machine learning algorithm platform and scene based on Flink

3.3 abnormal detection scenario

Anomaly detection and time sequence analysis is a common and widely used scenario, especially in industry. Using alicloud real-time computing Flink and Pai alink product portfolio can help customers quickly build anomaly detection solutions. The combination of Flink’s powerful performance and alink’s rich algorithm library can help data analysis and application developers to realize end-to-end processing of data processing, feature engineering, model training, prediction, etc. In the exception detection scenario, alink supports two core scenarios: time series anomaly detection and exception set detection.

In time series anomaly detection, alink has the advantages of complete types, batch flow integration, excellent performance, parallel computing, easy to use and so on. According to different scenarios, it can be divided into two types: time series prediction and time series decomposition

  • The time series prediction algorithm is suitable for streaming data and has instant response
  • Time series decomposition algorithm is suitable for full data and can mine effective information from full data.

Alink also provides time series prediction and time series decomposition algorithm, which can be used by users alone.

Machine learning algorithm platform and scene based on Flink

Anomaly set detection is one of the core demands of risk control scenarios. Alink anomaly set detection has the following advantages:

  • Mega graph support – support graph data with hundreds of millions of edges
  • Online update – local anomaly detection can be done at any time with abnormal seeds
  • Fast operation – only the local graph is operated to save computing resources

In the risk areas of embezzlement, fraud, cheating, merchants, loan arbitrage and other risk areas, there is a need for anomaly set detection. The detection of detection is based on the detection of radyk. The algorithm can analyze the whole picture by inputting the connection relationship and the known black spots, capture other black users, reduce the risk in the process of business operation, escort the business security, and avoid possible major losses.

Machine learning algorithm platform and scene based on Flink

4. Postscript

Through the above introduction, you must be eager to have a try on the alicloud real-time computing Flink and Pai product portfolio. You can quickly open full hosted real-time computing Flink and experience the latest serverless product service. Real time calculation of Flink touch through train: https://www.aliyun.com/product/bigdata/sc

Machine learning algorithm platform and scene based on Flink

By opening alicloud e-mapreduce dataflow cluster, we can quickly build a pai alink algorithm platform based on alicloud real-time computing Flink. Pai alink through train: https://www.aliyun.com/product/emapreduce

Machine learning algorithm platform and scene based on Flink

Link to original text
This article is the original content of Alibaba cloud and can not be reproduced without permission.