Introduction:By combining the Flink real-time flow computing framework with business scenarios, microblog has done a lot of work in terms of platform and service, and has also done a lot of optimization in terms of development efficiency and stability. We improve development efficiency through modular design and flat platform development.
Cao Fuqiang, senior system engineer and head of data computing in Weibo machine learning R & D center, introduces the application of Flink real-time computing in Weibo. The contents include:
1. Microblog introduction
2. Introduction of data computing platform
3. Typical application of Flink in data computing platform
Welcome to like Flink and send star~
1、 Microblog introduction
This time to share with you is the application of Flink real-time computing in microblog. Microblog is China’s leading social media platform, with 241 million daily active users and 550 million monthly active users, of which mobile users account for more than 94%.
2、 Introduction of data computing platform
1. Overview of data computing platform
The following figure shows the architecture of the data computing platform.
- The first is scheduling. Based on k8s and yarn, Flink and storm for real-time data processing and sql service for offline processing are deployed respectively.
- On top of the cluster, we deployed the AI platform of microblog to manage jobs, data, resources and samples.
On the platform, we build some services to support various business parties in a service-oriented way.
1. The services of real-time computing mainly include data synchronization, content de duplication, multimodal content understanding, real-time feature generation, real-time sample splicing and streaming model training, which are services closely related to business. In addition, it also supports Flink real-time computing and storm real-time computing, which are common basic computing frameworks.
2. In the offline part, a SQL computing service is built by combining hive’s SQL and sparksql, which has supported the vast majority of business parties in microblog.
- The output of data is based on the data warehouse and Feature Engineering, which provides data output. On the whole, at present, we run nearly 1000 real-time computing jobs online, more than 5000 offline jobs, and process more than 3 Pb of data every day.
2. Data calculation
The following two figures show data computing, one is real-time computing, the other is offline computing.
- Real time computing mainly includes real-time feature generation, multimedia feature generation and real-time sample generation, which are closely related to business. In addition, it also provides some basic real-time calculation of Flink and storm.
- Offline computing mainly includes SQL computing. It mainly includes SQL ad hoc query, data generation, data query and table management. Table management is mainly the management of data warehouse, including the management of table metadata, table permissions, and the blood relationship between the upstream and downstream of the table.
3. Real time features
As shown in the figure below, we built a real-time feature generation service based on Flink and storm. On the whole, it can be divided into job details, input source feature generation, output and resource configuration. Users can develop UDF generated by features according to our pre-defined interface. Other things like input and feature writing are automatically provided by the platform. Users only need to configure them on the page. In addition, the platform will provide input data source monitoring, job exception monitoring, feature writing monitoring, feature reading monitoring, etc., which are generated automatically.
4. Integration of flow and batch
Next, we introduce the batch flow integration based on flinksql. First of all, we will unify the metadata and unify the real-time logs and offline logs through the metadata management platform. After unification, when users submit jobs, we will have a unified scheduling layer. Scheduling is to schedule jobs to different clusters according to job types, job characteristics and current cluster load.
At present, the computing engine supported by the dispatching layer is mainly HiveSQL, SparkSQL and FlinkSQL. Hive and spark SQL are mainly used for batch computing, while flinksql is used for batch streaming. The whole result will be output to the data warehouse for business use. There are four key points in batch flow integration
- First, the batch stream code is unified to improve the development efficiency.
- Second, batch stream metadata is unified. Unified management to ensure the consistency of metadata.
- Third, the batch program is mixed to save resources.
- Fourth, batch flow unified scheduling, improve cluster utilization.
5. Data warehouse
- For the offline warehouse, we divide the data into three layers, one is the original log, the other is the middle layer, and the other is the data service layer. The middle is the unification of metadata, and the bottom is the real-time data warehouse.
- For real-time data warehouse, we use flinksql to make a streaming ETL for these original logs, and then write the final data results to the data service layer through a streaming summary. At the same time, we also store it in various real-time storage, such as es, HBase, redis and Clickhouse. We can provide external data query through real-time storage. It also provides the ability of further data calculation. In other words, the main purpose of building real-time data warehouse is to solve the problem of long period of offline feature generation. Another is to use flinksql to solve the problem of long development cycle of streaming job. One of the key points is the metadata management of offline data warehouse and real-time data warehouse.
3、 Typical application of Flink in data computing platform
1. Streaming machine learning
This paper first introduces several features of streaming machine learning, the biggest feature is real-time. This part is divided into feature real-time and model real-time.
- The real-time feature is mainly for more timely feedback of user behavior and more fine-grained description of users.
- The real-time model is to train the model in real time according to the online samples and reflect the online changes of the object in time.
Characteristics of microblog streaming machine learning:
- The scale of the sample is large, and the current real-time sample can reach millions of QPS.
- The scale of the model is large. In terms of model training parameters, the whole framework will support a training scale of 100 billion levels.
- The requirements for the stability of the operation are relatively high.
- The real-time requirement of samples is high.
- The model has high real-time performance.
- Platform business needs more.
There are several difficult problems in streaming machine learning
- One is full link, and the end-to-end link is relatively long. For example, a streaming machine learning process starts from log collection, to feature generation, to sample generation, then to model training, and finally to service online. The whole process is very long. Problems in any link will affect the final user experience. Therefore, we have deployed a set of relatively complete full link monitoring system for each link, and there are relatively rich monitoring indicators.
- Another is its large data scale, including massive user logs, sample scale and model scale. We investigated the common real-time computing framework, and finally chose Flink to solve this problem.
Machine learning process:
- The first is offline training. We get the offline log, generate samples offline, read samples through Flink, and then do offline training. After the training, the result parameters are saved in the offline parameter server. This result will be used as the base model of the model service for real-time cold start.
- Then there is the flow of real-time streaming machine learning. We will go to pull real-time logs, such as microblog content, interactive logs, etc. After pulling these logs, use Flink to generate its samples, and then do real-time training. After the training, the training parameters will be saved in a real-time parameter server, and then will be synchronized from the real-time parameter server to the real-time parameter server regularly.
- Finally, the model service will pull the parameters corresponding to the model from the parameter service to recommend user characteristics or material characteristics. The model scores the characteristics and behaviors related to the user and the material, and then the ranking service will retrieve the scoring results and add some recommended strategies to select the material it thinks is most suitable for the user, and feed it back to the user. After some interactive behaviors, users send out new online requests and generate new logs. So the whole flow learning process is a closed-loop process.
- The offline sample delay and model update are day level or hour level, while the streaming sample delay and model update are hour level or minute level;
- The computational pressure of offline model training is relatively concentrated, while the real-time computational pressure is relatively dispersed.
Here is a brief introduction to the development of our streaming machine learning samples. In October 2018, we launched the first streaming sample job through storm and external storage redis. In May 2019, we will use the new real-time computing framework Flink, and use the union + timer scheme to replace the window computing to realize the join operation of multiple data streams. In October 2019, a XX sample job was launched, and the QPS of a single job reached several hundred thousand. In April this year, the sample generation process was platformized. By June of this year, the platform has made an iteration to support the drop of samples, including the improvement of sample database and various monitoring indicators of samples.
The so-called sample generation of streaming machine learning is actually a splicing of multiple data streams according to the same key. For example, we have three data streams. After data cleaning, the result is stored as: K is the aggregated key, and V is the required value in the sample. After the data union, the keyby aggregation is performed, and the data is stored in the memory area value state after aggregation. As shown in the figure below:
- If K1 does not exist, register timer and save it in state.
- If K1 exists, take it out of the state and save it after updating. Finally, when its timer expires, the data will be output and cleared from the state.
We make a platform operation of the whole sample splicing process, which is divided into five modules, including input, data cleaning, sample splicing, sample formatting and output. Based on platform development, users only need to care about the business logic. Users are required to develop:
- Data cleaning logic corresponding to input data.
- Data format logic before sample output.
The rest can be configured on the UI
- Time window of sample stitching.
- Aggregation of fields in a window.
Resources are reviewed and configured by the platform. In addition, the whole platform provides some basic monitoring, including input data monitoring, sample index monitoring, abnormal operation monitoring, and sample output monitoring.
Sample UI of streaming machine learning project
The following figure is a sample of streaming machine learning project. On the left is the job configuration of sample generation, and on the right is the sample library. Sample library is mainly to do sample management display, including sample description, permissions, sample sharing, etc.
Application of machine learning
Finally, the application effect of downstream machine learning is introduced. At present, we support real-time sample splicing, and the QPS reaches the level of one million. Support streaming model training, can support hundreds of model training at the same time, model real-time support hour level / minute level model update. The whole process of streaming learning is disaster recovery, and the whole link automatic monitoring is supported. Recently, one of the things we are doing is deep learning in stream mode to increase the expression ability of real-time models. There is also reinforcement learning to explore some new application scenarios.
2. Multimodal content understanding
Multimodality is the ability or technology of using machine learning methods to realize or understand multimodal information. This part of microblog mainly includes pictures, videos, audio and text.
- This image includes object recognition, labeling, OCR, face, star, facial value, and intelligent clipping.
- Video includes copyright detection and logo recognition.
- In the audio section, there are labels for voice to text and audio.
- Text mainly includes word segmentation, timeliness and classification label.
For example, when we first did video classification, we only used the frames after video extraction, that is, the pictures. Later, in the second optimization, we added audio related things, as well as blog related things corresponding to video, which is equivalent to considering the fusion of audio, picture, text and multimodality, so as to generate the classification label of this video more accurately.
The following figure shows the platform architecture of multimodal content understanding. The middle part is Flink real-time computing, which receives the data of image stream, video stream and blog stream in real time, and then calls the underlying basic services through the model plug-in to deeply learn the model service. After the service is called, the content characteristics are returned. Then we store the features in Feature Engineering and provide them to all business parties through the data center. During the whole operation process of the job, the whole link monitors and alarms, and the abnormal situation responds at the first time. The platform automatically provides log collection, index statistics, case tracking and other functions. In the middle part, ZK is used for service discovery to solve the problem of service state synchronization between real-time computing and deep learning model. In addition, in addition to state synchronization, there are also some load balancing strategies. The bottom is to use the data reconciliation system to further improve the success rate of data processing.
The UI of multimodal content understanding mainly includes job information, input source information, model information, output information and resource configuration. Through the configuration of the development, to improve the development efficiency. Then some monitoring indicators of model call will be generated automatically, including the success rate and time-consuming of model call. When a job is submitted, it will automatically generate a job for indicator statistics.
3. Content de duplication service
In the recommendation scenario, if you push repeated content to users all the time, it will affect the user experience. Based on this consideration, combined with Flink real-time stream computing platform, distributed vector retrieval system and deep learning model service, a set of content de duplication service platform is constructed, which has the characteristics of low latency, high stability and high recall rate. At present, it supports multiple business parties with a stability of 99.9%.
The following figure shows the architecture of content de duplication service. The bottom is the multimedia model training. This is for offline training. For example, we will get some sample data, and then do sample processing. After sample processing, we will save the samples in the sample library. When I need to do model training, I pull samples from the sample library, then do model training, and the training results will be saved to the model library.
The main model used here is vector generation model. Including image vector, text vector and video vector.
After we verify the trained model, we will save the model to the model library. The model library stores some basic information of the model, including the running environment and version of the model. Then the model needs to be deployed online. During the process of deployment, the model needs to be pulled from the model library. At the same time, we need to know the technical environment in which the model runs.
After the model is deployed, we will read the material from the material library in real time through Flink, and then call the multimedia prediction service to generate the corresponding vector of these materials. Then these vectors will be saved in Weiss database, which is a vector recall retrieval system developed by Weibo. After it is stored in the Weiss library, the vector recall process will be carried out for this material to recall a batch of materials similar to this material. In this aspect, we will add a certain strategy to all the recall results to select the most similar one. Then aggregate the most similar one with the current material to form a content ID. Finally, when the business is used, it is also done through the content ID corresponding to the material.
There are three main business scenarios for content de duplication
- First, it supports video copyright – pirated video recognition – 99.99% stability and 99.99% pirated recognition rate.
- Second, support the whole station microblog VIDEO DE duplication – recommend scene application – stability of 99.99%, processing delay of seconds.
- Thirdly, the recommended flow material weight removal stability is 99%, the processing delay is seconds, and the accuracy is 90%
By combining Flink real-time flow computing framework with business scenarios, we have done a lot of work in terms of platform and service, and also made a lot of optimization in terms of development efficiency and stability. We improve development efficiency through modular design and flat platform development. At present, the real-time data computing platform has its own full link monitoring, data index statistics and debug case tracking (log review) system. In addition, flinksql has some applications in batch flow integration. These are some new changes brought to us by Flink. We will continue to explore more application space of Flink in microblog.
For more technical exchanges related to Flink, you can scan the code to join the community nailing group ~
Activity recommendation 1
Activity recommendation 2
Alibaba cloud’s enterprise product based on Apache Flink real time computing Flink version is now open:
99 yuan trial real-time computing Flink full hosting Version (monthly package, 10cu) can be customized Flink exclusive custom T-shirt; Another 85% discount for 3 months and above!
Learn more about the event:https://www.aliyun.com/product/bigdata/sc
Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.