Introduction: by combining Flink real-time stream computing framework with business scenarios, microblog has done a lot of work in platform and service, and has also done a lot of optimization in development efficiency and stability. We improve development efficiency through modular design and platform development.
Cao Fuqiang, senior system engineer and data computing director of microblog machine learning R & D center, introduced the application of Flink real-time computing in microblog. The contents include:
1. Microblog introduction
2. Introduction to data computing platform
3. Typical application of Flink in data computing platform
1、 Microblog introduction
This time, we will share the application of Flink real-time computing in microblog. Microblog is a leading social media platform in China. At present, the daily active users are 241 million and the monthly active users are 550 million, of which mobile users account for more than 94%.
2、 Introduction to data computing platform
1. Overview of data computing platform
The following figure shows the architecture of the data computing platform.
- The first is scheduling. Based on k8s and yarn, Flink and storm for real-time data processing and SQL services for offline processing are deployed respectively.
- On the cluster, we deployed the AI platform of microblog to manage jobs, data, resources, samples, etc.
- On the platform, we have built some services to support various business parties in a service-oriented way.
-The services of real-time computing mainly include data synchronization, content de duplication, multi-modal content understanding, real-time feature generation, real-time sample splicing and streaming model training, which are services closely related to business. In addition, it also supports Flink real-time computing and storm real-time computing, which are common basic computing frameworks.
-In the offline part, combined with hive’s SQL and sparksql, an SQL computing service is built. At present, it has supported the vast majority of business parties within microblog.
The data output is based on the data warehouse and feature engineering to provide external data output. On the whole, we have nearly 1000 real-time computing jobs running online, more than 5000 offline jobs, and more than 3 Pb of data processed every day.
2. Data calculation
The following two figures are data calculation, one of which is real-time calculation and the other is offline calculation.
- Real time computing mainly includes real-time feature generation, multimedia feature generation and real-time sample generation, which are closely related to business. In addition, it also provides some basic Flink real-time calculation and storm real-time calculation.
- Offline computing mainly includes SQL computing. It mainly includes ad hoc query, data generation, data query and table management of SQL. Table management mainly refers to the management of data warehouse, including the management of table metadata, table permissions, and the blood relationship between upstream and downstream tables.
3. Real time features
As shown in the figure below, we built a real-time feature generation service based on Flink and storm. On the whole, it is divided into job details, input source feature generation, output and resource configuration. The user can develop the UDF generated by the feature according to the interface defined in advance. Other functions such as input and feature writing are automatically provided by the platform. Users only need to configure them on the page. In addition, the platform will provide input data source monitoring, job exception monitoring, feature write monitoring, feature read monitoring, etc., which are automatically generated.
4. Flow batch integration
The following describes our batch flow integration based on flinksql. First, we will unify the metadata and unify the real-time log and offline log through the metadata management platform. After unification, when users submit jobs, we will have a unified scheduling layer. Scheduling is to schedule jobs to different clusters according to job types, job characteristics and current cluster load.
At present, the computing engine supported by the dispatching layer is mainly HiveSQL, SparkSQL and FlinkSQL. Hive and spark’s SQL are mainly used for batch computing, and flinksql is used for batch stream mixed running. The whole result will be output to the data warehouse and provided to the business party for use. There are about four key points in batch flow integration:
- The first is to unify the stream code and improve the development efficiency.
- Second, the batch flow metadata is unified. Unified management to ensure consistent metadata.
- Third, batch flow programs run together to save resources.
- Fourth, the batch flow is uniformly scheduled to improve the cluster utilization.
5. Data warehouse
- For the offline warehouse, we divide the data into three layers, one is the original log, the other is the middle layer, and the other is the data service layer. In the middle is the unification of metadata, and below is the real-time data warehouse.
- For the real-time data warehouse, we stream an ETL for these original logs through flinksql, and then write the final data results to the data service layer through a streaming summary. At the same time, we will also store it in various real-time stores, such as es, HBase, redis and Clickhouse. We can query the external data through real-time storage. It also provides the ability to further calculate data. In other words, the establishment of real-time data warehouse is mainly to solve the problem of long cycle of off-line feature generation. In addition, flinksql is used to solve the problem of long development cycle of streaming job. One of the key points is the metadata management of offline data warehouse and real-time data warehouse.
3、 Typical application of Flink in data computing platform
1. Flow machine learning
Firstly, several characteristics of flow machine learning are introduced. The biggest characteristic is real-time. The real-time model is divided into real-time model and real-time model.
- The real-time feature is mainly to feed back the user behavior in time and describe the user more finely.
- The real-time model is to train the model in real time according to the online samples and reflect the online changes of the object in time.
■ features of microblog streaming machine learning:
- The scale of samples is large, and the current real-time samples can reach millions of QPS.
- The scale of the model is large. In terms of model training parameters, the whole framework will support a training scale of 100 billion levels.
- The stability requirements of the operation are relatively high.
- The real-time requirements of samples are high.
- The real-time performance of the model is high.
- The platform has many business requirements.
■ there are several difficult problems in flow machine learning:
- One is the full link, and the end-to-end link is relatively long. For example, a flow machine learning process starts from log collection, feature generation, sample generation, model training, and finally service launch. The whole process is very long. Any problem in any link will affect the final user experience. Therefore, we have deployed a relatively complete full link monitoring system for each link, and there are relatively rich monitoring indicators.
- The other is its large data scale, including massive user logs, sample scale and model scale. We investigated the commonly used real-time computing framework, and finally chose Flink to solve this problem.
■ machine learning process:
- The first is offline training. We get the offline log, generate samples offline, read samples through Flink, and then do offline training. After the training, the training result parameters are saved in the offline parameter server. This result will be used as the base model of the model service for real-time cold start.
- The process of real-time learning is machine flow. We will pull real-time logs, such as microblog release content, interactive logs, etc. After pulling these logs, use Flink to generate its samples, and then do real-time training. After the training, the training parameters will be saved in a real-time parameter server, and then regularly synchronized from the real-time parameter server to the real-time parameter server.
- Finally, the model service will pull the parameters corresponding to the model from the parameter service to recommend user characteristics, or material characteristics. Score the characteristics and behaviors related to users and materials through the model, and then the sorting service will retrieve the scoring results and add some recommended strategies to select the material that it thinks is most suitable for users and feed it back to users. After the client generates some interactive behavior, the user sends a new online request and generates a new log. Therefore, the whole flow learning process is a closed-loop process.
- In addition,
- The off-line sample delay and model update are day level or hour level, while the streaming is hour level or minute level;
- The computational pressure of off-line model training is relatively concentrated, while the real-time computational pressure is relatively scattered.
Here is a brief introduction to the development of our streaming machine learning samples. In October 2018, we launched the first streaming sample job through storm and external storage redis. In May 2019, we used Flink, a new real-time computing framework, and adopted the union + timer scheme instead of window computing to realize the join operation of multiple data streams. In October 2019, a XX sample job was launched, and the QPS of a single job reached hundreds of thousands. In April this year, the sample generation process was platformized. By June this year, the platform had made an iteration to support the dropping of samples, including the sample library and the improvement of various monitoring indicators of samples.
The so-called sample generation in streaming machine learning is actually a splicing of multiple data streams according to the same key. For example, we have three data streams. The results after data cleaning are stored as, K is the aggregated key, and V is the value required in the sample. After data union, do keyby aggregation. After aggregation, store the data in the memory area value state. As shown in the figure below:
- If K1 does not exist, register timer and save it in state.
- If K1 exists, take it out of the state, update it, and then save it. Finally, after its timer expires, this data will be output and cleared from the state.
■ sample platform
We have done a platform operation for the whole sample splicing process, which is divided into five modules, including input, data cleaning, sample splicing, sample formatting and output. Based on platform development, users only need to care about the business logic part. User development is required:
- Data cleaning logic corresponding to the input data.
- Data formatting logic before sample output.
The rest can be configured on the UI, including:
- Time window for sample splicing.
- Aggregation of fields within the window.
Resources shall be reviewed and configured by the platform party. In addition, the whole platform provides some basic monitoring, including input data monitoring, sample index monitoring, job abnormality monitoring and sample output monitoring.
■ sample UI for streaming machine learning project
The following figure is a sample of a flow machine learning project. On the left is the job configuration of sample generation, and on the right is the sample library. The sample library is mainly used for sample management and display, including sample description, permission, sample sharing, etc.
■ application of machine learning
Finally, the effect of a downflow machine learning application is introduced. At present, we support real-time sample splicing, and QPS reaches the level of one million. It supports streaming model training, can support hundreds of model training at the same time, and the real-time performance of the model supports hourly / minute model update. The whole process of flow learning disaster recovery supports the automatic monitoring of the whole link. Recently, one of the things we are doing is deep learning in flow mode to increase the expression ability of real-time models. There is also reinforcement learning to explore some new application scenarios.
2. Multimodal content understanding
Multimodality is the ability or technology to use some methods of machine learning to realize or understand multimodal information. Microblog mainly includes pictures, videos, audio and text.
- The picture includes object recognition, labeling, OCR, face, star, face value and intelligent cutting.
- The video includes copyright detection and logo recognition.
- There are labels for audio, voice to text and audio.
- Text mainly includes text word segmentation, text timeliness and text classification label.
For example, when we first do video classification, we only use those frames after video frame extraction, that is, pictures. Later, during the second optimization, audio related things and blog related things corresponding to video were added, which is equivalent to considering the integration of audio, pictures, text and multimodality to generate the classification label of the video more accurately.
The following figure shows the platform architecture of multimodal content understanding. The middle part is Flink real-time computing, which receives the data of picture stream, video stream and blog stream in real time, and then calls the underlying basic service through the model plug-in to deeply learn the model service. After calling the service, the content characteristics are returned. Then we store the features in Feature Engineering and provide them to various business parties through the data center. During the whole operation process, the whole link monitors and alarms, and responds to abnormal conditions at the first time. The platform automatically provides log collection, indicator statistics, case tracking and other functions. The middle part uses ZK for service discovery to solve the problem of service state synchronization between real-time computing and deep learning models. In addition, in addition to state synchronization, there are also some load balancing strategies. The bottom is to use the data reconciliation system to further improve the success rate of data processing.
The UI for multimodal content understanding mainly includes job information, input source information, model information, output information and resource configuration. Through configuration development, we can improve the development efficiency. Then it will automatically generate some monitoring indicators of model call, including the success rate and time-consuming of model call. When a job is submitted, it will automatically generate a job for indicator statistics.
3. Content de duplication service
In the recommendation scenario, if you keep pushing repeated content to users, it will greatly affect the user experience. Based on this consideration, a set of content de duplication service platform is constructed by combining Flink real-time stream computing platform, distributed vector retrieval system and deep learning model service, which has the characteristics of low delay, high stability and high recall rate. At present, it supports multiple business parties, with a stability of 99.9 +%.
The following figure shows the architecture of the content de duplication service. At the bottom is multimedia model training. This is for offline training. For example, we will get some sample data, and then do sample processing. After sample processing, we will store the samples in the sample library. When I need to do model training, I pull samples from the sample library, and then do model training. The training results will be saved to the model library.
The main model used here is the vector generation model. Including picture vector, text vector and video vector.
When we verify that the trained model has no problem, we will save the model to the model library. The model library stores some basic information of the model, including the running environment and version of the model. Then you need to deploy the model online. In the deployment process, you need to pull the model from the model library and know some technical environments for the operation of the model.
After the model is deployed, we will read the material from the material library in real time through Flink, and then call the multimedia prediction service to generate the corresponding vector of these materials. Then these vectors will be saved in Weiss library, which is a vector recall retrieval system developed by Weibo. After being stored in the Weiss warehouse, a vector recall process will be carried out for this material to recall a batch of materials similar to this material. In the fine alignment comparison, a certain strategy will be added from all recall results to select the most similar one. Then, the most similar item is aggregated with the current material to form a content ID. Finally, when the business is used, it is also de duplicated through the content ID corresponding to the material.
There are three main business scenarios for content de duplication:
- First, support video copyright – Pirate video recognition – stability 99.99%, and pirate recognition rate 99.99%.
- Second, it supports the site wide microblog VIDEO DE duplication – Recommended scenario application – stability of 99.99%, and processing delay of seconds.
- Third, it is recommended to remove the weight of the flow material – the stability is 99%, the processing delay is seconds, and the accuracy is 90%
By combining Flink real-time stream computing framework with business scenarios, we have done a lot of work in platform and service, and also made a lot of optimization in development efficiency and stability. We improve development efficiency through modular design and platform development. At present, the real-time data computing platform is equipped with full link monitoring, data index statistics and debug case tracking (log review) system. In addition, flinksql is also used in batch flow integration. These are some new changes brought to us by Flink. We will continue to explore Flink’s greater application space in microblog.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.