Author: Chen WuChao (Zhong Zhuo), technical expert of Alibaba
Deep learning technology plays an increasingly important role in the contemporary society. At present, deep learning is widely used in many fields, such as personalized recommendation, commodity search, face recognition, machine translation, automatic driving and so on.
At present, the application of deep learning is more and more diversified, and many excellent computing frameworks emerge. Tensorflow, pytorch and mxnet are widely used frameworks. In the process of applying deep learning to practical business, it is often necessary to combine with the calculation framework related to data processing, such as: training data needs to be processed to generate training samples before model training, and some indicators of processing data need to be monitored in the process of model prediction. In this case, data processing and model training need to use different computing engines, which increases the difficulty of users.
This article will share how to use a set of engine to solve the whole process of machine learning. Let’s first introduce a typical machine learning workflow. As shown in the figure, the whole process includes feature engineering, model training, offline or online prediction.
In this process, whether it is feature engineering, model training or model prediction, logs will be generated in the middle. We need to use data processing engine such as Flink to analyze these logs, and then enter into feature engineering. Then, tensorflow, a deep learning computing engine, is used for model training and model prediction. When the model is trained, the online scoring is done by tensor serving.
Although the above process can run smoothly, there are still some problems, such as:
- The same machine learning project needs Flink and tensorflow computing engines to do feature engineering, model training and model prediction, so the deployment is relatively more complex.
- Tensorflow is not friendly enough in distributed support, and the IP address and port number of the machine need to be specified in the running process; however, the actual production process is often run on a scheduling system, such as Yan, which needs to dynamically allocate IP address and port number.
- Tensorflow is lack of automatic failure mechanism.
To solve the above problems, we combine Flink and tensorflow to run tensorflow programs on Flink cluster. The overall process is as follows:
The feature engineering is implemented by Flink. The tensorflow computing engine can run on the Flink cluster because of the model training and the quasi real-time prediction target of the model. In this way, Flink, a computing engine, can be used to support model training and model prediction, which is simpler to deploy and saves resources.
Introduction to Flink computing
Flink is an open source big data distributed computing engine. In Flink, all calculations are abstracted into operators. As shown in the above figure, the node that reads data is called source operator, and the node that outputs data is called sink operator. There are a variety of Flink operators to process between source and sink. The computing topology in the figure above contains three sources and two sinks.
Machine learning distributed topology
The distributed running topology of machine learning is shown in the following figure:
In a machine learning cluster, a group of nodes is often grouped. As shown in the figure above, a group of nodes can be workers (running algorithm) or PS (updating parameters).
How to combine Flink’s operator structure with the node and application manager roles of machine learning? The following is a detailed explanation of the abstraction of Flink AI extended.
Flink AI extended abstraction
Firstly, the cluster of machine learning is abstracted and named ml framework. At the same time, machine learning also includes ml operator. Through these two modules, Flink and machine learning cluster can be combined, and different computing engines, including tensorflow, can be supported.
As shown in the figure below:
In Flink running environment, ML framework and ml operator module are abstracted to connect Flink and other computing engines.
Ml framework is divided into two roles.
- The role of Application Manager (hereinafter referred to as am) is responsible for managing the life cycle of all nodes.
- Node role, responsible for the implementation of machine learning algorithm program.
In the above process, we can further abstract the application manager and node. In the application manager, we separately make the state machine of the state machine extensible, so as to support different types of jobs.
Deep learning engine can define its own state machine. The runner interface is abstracted from node nodes, so that users can customize the algorithm program according to different deep learning engine.
The ML operator module provides two interfaces:
- Addamrole, this interface is used to add an application manager role to Flink jobs. As shown in the figure above, application manager is the management node of machine learning cluster.
- Addrole, adding a set of nodes for machine learning.
Using the interface provided by ml operator, we can realize the role of application manager and three groups of nodes in Flink operator. These three groups of nodes are called role a, role B and role C respectively. Three different roles form a cluster of machine learning. As shown in the code above. Flink’s operator corresponds to the node of machine learning task one by one.
The node node of machine learning runs in Flink’s operator and needs data exchange. The principle is shown in the following figure:
Flink operator is a java process. The node node of machine learning is generally Python process. Java and python processes exchange data through shared memory.
TensorFlow On Flink
Tensorflow distributed operation
Tensorflow distributed training is generally divided into worker and PS roles. Worker is responsible for machine learning calculation, PS is responsible for parameter update. Here’s how tensorflow runs in a Flink cluster.
Tensorflow batch training operation mode
In batch mode, the sample data can be placed on HDFS. For Flink jobs, it will act as a source operator, and then the work role of tensorflow will be started. As shown in the figure above, if the role of the worker has three nodes, the parallelism of the source is set to 3. Similarly, there are two PS roles below, so the PS source node will be set to 2. Application manager and other roles do not exchange data, so application manager is a separate node, so its source node parallelism is always 1. In this way, three workers and two PS nodes are started on the Flink job. The communication between the workers and PS is realized through the grpc communication of tensorflow, not the communication mechanism of Flink.
Tensorflow stream training operation mode
As shown in the figure above, there are two source operators in front, and then join operators to merge the two data into one data, and then add nodes for user-defined processing to generate sample data. In stream mode, the role of worker is realized through udtf or flatmap.
At the same time, there are three tensorflow worker nodes, so the parallelism of the operators corresponding to flatmap and udtf is also 3. Since the PS role does not read data, it is implemented through the Flink source operator.
Let’s talk about how to support real-time prediction if the model has been trained.
Prediction using Python
The process of using Python for prediction is shown in the figure. If tensorflow’s model is a distributed training model, and the model is very large, for example, when a single machine can’t fit it, it usually appears in the scenario of recommendation and search. Then the principle of real-time prediction and real-time training are the same, the only difference is that there is an additional loading model process.
In the case of prediction, all parameters are loaded into PS by reading the model. Then the upstream data is processed in the same way as in training. The data flows into the role of worker for processing. The predicted score is written back to the Flink operator and sent to the downstream operator.
Prediction using java
As shown in the figure, there is no need to remove the PS node when a single model is used for prediction. A single worker can install the whole model for prediction, especially export the save model using tensorflow. At the same time, because the saved model format contains all the calculation logic and input and output of the whole deep learning prediction, it can be predicted without running Python code.
There is also a way to predict. The source, join and udtf all process the data into a data format that can be recognized by the prediction model. In this case, the trained model can be loaded into the memory directly through tensorflow Java API in the java process. At this time, it will be found that PS role is not needed, and worker role is also java process, not python So we can predict directly in the java process and send the prediction results to the downstream of Flink.
In this paper, we explain the principle of Flink AI extended and how Flink combines tensorflow to train and predict models. I hope that through this sharing, we can use Flink AI extended to support model training and model prediction through Flink homework.