Introduction:This paper introduces the application of ULTRON, the machine learning workflow platform of station B, in multiple machine learning scenarios of station B.
Sharing guest: Zhang Yang, senior development engineer of station B
Introduction: the whole process of machine learning, from data reporting, feature calculation, model training, online deployment and final effect evaluation, is very lengthy. In station B, multiple teams will build their own machine learning links to meet their machine learning needs. It is difficult to ensure engineering efficiency and data quality. Therefore, based on the aiflow project of Flink community, we built a complete set of standard workflow platform for machine learning, accelerated the construction of machine learning process, and improved the data effectiveness and accuracy of multiple scenarios. This sharing will introduce the application of the machine learning workflow platform ULTRON of station B in multiple machine learning scenarios of station B.
1. Real time machine learning
2. Use of Flink in machine learning at station B
3. Construction of machine learning workflow platform
4. Future planning
Welcome to like Flink and send star~
1、 Real time machine learning
First, let’s talk about the real-time machine learning, which is mainly divided into three parts:
- The first is the real-time of samples. In traditional machine learning, all samples are t + 1, that is, today’s model uses yesterday’s training data, and every morning, the model is trained with yesterday’s all day data;
- The second is the real-time feature. The previous features are basically T + 1, which will lead to some inaccurate recommendations. For example, I watched a lot of new videos today, but what I recommended was some content I saw yesterday or longer ago;
- The third is the real-time model training. After we have real-time samples and real-time features, model training can also achieve real-time online training, which can bring more real-time recommendation effect.
Traditional offline link
The above figure is a traditional offline link diagram. First, the app generates logs or the server generates logs. The whole data will fall onto HDFS through the data pipeline, and then do some feature generation and model training every day at t + 1. The feature generation will be put into the feature storage, which may be redis or some other kV storage, and then sent to the above information online service.
Shortcomings of traditional offline links
What’s wrong with it?
- The first is that the characteristic timeliness of T + 1 data model is very low, so it is difficult to update with particularly high timeliness;
- Second, in the whole process of model training or some feature production, day level data should be used every day. The whole training or feature production takes a very long time and requires very high computing power of the cluster.
Real time link
In the figure above, the Red Cross is removed in the whole process of real-time link after optimization. After the whole data is reported, it will be directly sent to the real-time Kafka through the pipeline. Then, it will generate real-time features and real-time samples. The feature results will be written to the feature store. The generation of samples also needs to read some features from the feature store.
After generating the samples, we directly conduct real-time training. The long link on the right has been removed, but we have saved the offline feature. Because we still need to do some off-line calculations for some special features, such as those that are particularly complex and difficult to real-time or have no real-time requirements.
2、 The use of Flink in B station machine learning
Let’s talk about how we do real-time samples, real-time features and real-time effect evaluation.
- The first is real-time samples. Flink’s current production process of all recommended business sample data of hosting B station;
The second is the real-time feature. At present, quite a number of features use Flink for real-time calculation, with high timeliness. Many features are obtained by offline + real-time combination. Historical data is calculated offline, real-time data is calculated by Flink, and features are read by splicing.
However, these two sets of computing logic can not be reused sometimes, so we are also trying to use Flink as a batch stream integration, and use Flink to define all features. According to business needs, real-time computing or offline computing, the underlying computing engine is Flink;
- The third is a real-time effect evaluation. We use Flink + OLAP to open up the whole real-time computing + real-time analysis link for the final model effect evaluation.
Real time sample generation
The above figure shows the generation of current real-time samples for the whole recommended service link. After the log data falls into Kafka, first we make a Flink label join to splice the click and display. After the results continue to fall into Kafka, a Flink task is followed to join the features. The feature join will splice multiple features, some of which are public domain features and some are private domain features of the business party. Features come from a variety of sources, including offline and real-time. After all the features are completed, an instance sample data will be generated and sent to Kafka for later training models.
Real time feature generation
The figure above shows the generation of real-time features. Here is a complex feature process. The whole calculation process involves five tasks. The first task is an offline task, followed by four Flink tasks. A feature generated after a series of complex calculations falls into Kafka and is written into the feature store, which is then used for online prediction or real-time training.
Real time effect evaluation
The figure above shows the evaluation of real-time effect. A very core indicator of the recommendation algorithm is CTR click through rate. After label join, CTR data can be calculated. In addition to the next step of sample generation, a data will be led to the Clickhouse. After the report system is connected, a very real-time effect can be seen. The data itself will be labeled with experimental labels. Experiments can be distinguished according to the labels in the Clickhouse to see the corresponding experimental effects.
3、 Construction of machine learning workflow platform
- The whole link of machine learning includes sample generation, feature generation, training, prediction and effect evaluation. Each part needs to be configured with many development tasks. The launch of a model eventually needs to span multiple tasks, and the link is very long.
- The new algorithm is difficult for students to understand the whole picture of this complex link, and the learning cost is very high.
- The change of the whole link will affect the whole body, which is very easy to fail.
- Multiple engines are used in the computing layer, and batch streams are mixed. It is difficult to maintain consistency in semantics. It is also difficult to develop two sets of the same logic without gap.
- The whole real-time cost threshold is also relatively high, which requires a strong real-time offline ability. Many small business teams are difficult to complete without the support of the platform.
The figure above shows the general process of a model from data preparation to training, involving seven or eight nodes. Can we complete all process operations on one platform? Why do we use Flink? Because our team’s real-time computing platform is based on Flink, we also see Flink’s potential in batch flow integration and some future development paths in real-time model training and deployment.
Aiflow is an open source machine learning workflow platform of Ali’s Flink ecological team, focusing on the standardization of processes and the whole machine learning link. In August and September last year, after contacting them, we introduced such a system, built and improved it together, and began to land in station B gradually. It abstracts the whole machine learning into the processes of example, transform, train, validation and information on the graph. In the project architecture, the core capability scheduling is to support flow batch mixed dependency. The metadata layer supports model management and is very convenient for iterative model updating. Based on this, we built our machine learning workflow platform.
Next, let’s talk about platform features:
- The first is to define workflows using python. In the AI direction, we still use Python more. We also refer to some external. For example, Netflix also uses Python to define this machine learning workflow.
- The second is to support the mixed dependency of batch flow tasks. In a complete link, the real-time offline processes involved can be added, and the batch flow tasks can rely on each other through signals.
- The third is to support the whole experimental process of one click cloning. From the original log to the final experiment, we hope to clone the whole link with one click and quickly pull up a new experimental link.
- The fourth is some performance optimization to support resource sharing.
- The fifth is to support the integration of feature backtracking and batch flow. The cold start of many features needs to calculate the data with a long history. It is very expensive to write a set of offline feature calculation logic for cold start, and it is difficult to align with the real-time feature calculation results. We support tracing the offline features directly on the real-time link.
The figure above shows the basic architecture, with business at the top and engine at the bottom. At present, there are many supported engines: Flink, spark, hive, Kafka, HBase and redis. There are computing engines and storage engines. The whole workflow platform is designed with aiflow as the intermediate workflow management and Flink as the core computing engine.
The whole workflow is described in Python. In Python, users only need to define computing nodes and resource nodes, as well as the dependencies between these nodes. The syntax is a bit like the scheduling framework airflow.
There are four kinds of dependency relationships of batch flow: flow to batch, flow to stream, batch to stream, and batch to batch. It can basically meet all our current business needs.
Resource sharing is mainly used for performance, because often the learning link of a machine is very long. For example, I often change only five or six nodes in the diagram just now. When I want to restart the whole experimental process and clone the whole diagram, I only need to change some or most of the nodes in the middle, Upstream nodes can share data.
This is a technical implementation. After cloning, a state tracking is performed on the shared node.
Real time training
The above figure shows the process of real-time training. Feature traversal is a very common problem, which occurs when the progress of multiple computing tasks is inconsistent. In the workflow platform, we can define the dependencies of each node. Once dependencies occur between nodes, the processing progress will be synchronized. Generally speaking, it is fast and slow to avoid feature crossing. In Flink, we use watermark to define the processing progress.
The figure above shows the process of feature backtracking. We use the real-time link to directly backtrack its historical data. Offline and real-time data are different after all. There are many problems to be solved, so spark is also used. We will change it to Flink in the later part.
Problem of feature backtracking
There are several major problems in feature backtracking:
- The first is how to ensure the order of data. The implicit meaning of real-time data is that the data comes in sequence and is processed immediately after production, which naturally has a certain order. However, offline HDFS is not. HDFS has partitions. The data in the partitions are completely out of order. A large number of calculation processes in the actual business depend on timing. How to solve the out of order of offline data is a big problem.
- The second is how to ensure the consistency of features and sample versions. For example, there are two links, one is feature production and the other is sample production. Sample production depends on feature production. How to ensure the consistency of versions between them without crossing?
- The third is how to ensure the consistency of calculation logic between real-time link and backtracking link? In fact, we don’t have to worry about this problem. We directly trace offline data on the real-time link.
- The fourth is some performance problems, how to quickly calculate a large amount of historical data.
Here are the solutions to the first and second problems:
- First question. For the sake of data order, we Kafka process the offline data of HDFS. Instead of pouring it into Kafka, we simulate the data architecture of Kafka, partition and order in the partition. We also process the HDFS data into a similar architecture, simulate the logical partition and order in the logical partition, The hdfssource read by Flink has also been developed to support this simulated data architecture. The simulation calculation of this piece is currently done using spark, and we will change it to Flink later.
The second question is divided into two parts:
- The solution of real-time feature part depends on HBase storage, and HBase supports query according to version. After the feature is calculated, it is directly written into HBase according to the version. When the sample is generated, check the corresponding version number on the HBase. The version in it is usually the data time.
- For the offline feature part, because there is no need to recalculate, there are all offline storage HDFS, but point queries are not supported. It is good to kvize this part. We have done asynchronous preloading for performance.
The process of asynchronous preloading is shown in the figure.
4、 Future planning
Next, let’s introduce our later planning.
- One is data quality assurance. Now the whole link is getting longer and longer. There may be 10 nodes and 20 nodes. How can we quickly find the problem point when the whole link goes wrong. Here we want to do DPC for the node set. For each node, we can customize some data quality verification rules. The data is bypassed to the unified DQC Center for rule operation alarm.
- The second is the exact once of the whole link. How to ensure the accuracy and consistency between workflow nodes is not clear at present.
- Third, we will add model training and deployment nodes to the workflow. Training and deployment can be connected to other platforms or training models and deployment services supported by Flink itself.
Guest introduction:Zhang Yang, who joined B station in 17 years, is engaged in big data.
Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.