Online Prediction of Deep Learning Based on TensorFlow Serving


I. Preface

With the continuous development of in-depth learning in image, language, advertisement click-through rate prediction and other fields, many teams began to explore the practice and application of in-depth learning technology at the business level. In terms of advertising CTR prediction, new models emerge in endlessly: Wide and Deep [1], DeepCross Network [2], DeepFM [3], xDeepFM [4], and many in-depth learning blogs of the American delegation are also introduced in detail. However, when offline models need to be online, new problems will be encountered: whether the performance of offline models can meet online requirements, how to insert model prediction into the original engineering system, and so on. Only by accurately understanding the deep learning framework can we better deploy the deep learning on-line, so as to be compatible with the original engineering system and meet the online performance requirements.

Firstly, this paper introduces the business scenario and offline training process of User Growth Group of Xiametuan Platform. Then it mainly introduces the whole process of deploying WDL model online using TensorFlow Serving, and how to optimize the performance of online service, hoping to inspire everyone.

Business scenarios and offline processes

2.1 Business Scenario

In the scenario of elaborate advertising, for each user, there will be at most hundreds of advertisement recalls. According to the user characteristics and the characteristics of each advertisement, the model predicts the click-through rate of each user, and then ranks them. Because of the time-out limitation of ADExchange for DSP, the average response time of our sorting module must be controlled within 10 ms. Meanwhile, American Mission DSP needs to participate in real-time bidding according to the estimated click-through rate, so the performance of model prediction is required to be high.

2.2 Off-line Training

For offline data, we use Spark to generate the original data format tfrecord of TensorFlow [5] to speed up data reading.

For the model, the classical Wide and Deep model is used. The features include user dimension, scene dimension and commodity dimension. Wide part has more than 80 feature inputs, Deep part has more than 60 feature inputs, through Embedding input layer about 600 dimensions, followed by three layers 256 full-width connection, model parameters have a total of 350,000 parameters, corresponding to the size of the exported model file about 11M.

For offline training, the distributed framework of TensorFlow synchronization + Backup Workers [6] is used to solve the problem of asynchronous update delay and slow performance of synchronous update.

In the distributed PS parameter allocation, the Greedy Load Balancing method is used to distribute the parameters according to the estimated parameters, instead of Round Robin’s modular allocation method, which can balance the load of each PS.

In terms of computing devices, we find that the training speed will be faster if only CPU is used instead of GPU. This is mainly because although the performance of GPU may be improved, it increases the cost of data transmission between CPU and GPU. When the model calculation is not too complicated, the effect of using CPU will be better.

At the same time, we use Estimator advanced API to encapsulate data reading, distributed training, model validation and TensorFlow Serving model export.
The main benefits of using Estimator are:

  1. Single-machine training and distributed training can be easily switched, and there is no need to modify too much code when using different devices: CPU, GPU, TPU.
  2. Estimator’s framework is clear and easy for developers to communicate with each other.
  3. Beginners can also directly use some established Estimator models: DNN model, XGBoost model, linear model and so on.

3. TensorFlow Serving and Performance Optimization

3.1 Introduction to TensorFlow Serving

TensorFlow Serving is a high-performance open source library for machine learning model Serving. It can deploy the trained machine learning model online and accept external calls using gRPC as an interface. TensorFlow Serving supports model hot update and automatic model version management, and has very flexible features.

The following picture shows the whole framework of TensorFlow Serving. The Client side sends requests to the Manager continuously, and the Manager updates the model according to the version management policy, and returns the latest model calculation results to the Client side.

Online Prediction of Deep Learning Based on TensorFlow ServingTensorFlow Serving Architecture, Pictures from the official documents of TensorFlow Serving

Within the company, TensorFlow Serving, which is provided by the data platform, runs distributed on the cluster through YARN, periodically scans the HDFS path to check the model version and automatically updates it. Of course, TensorFlow Serving can be installed on every local machine for testing.

In the scenario of our out-of-site advertisements, every time a user arrives, the online requester will convert all the information of the user and the recalled 100 advertisements into a model input format, and then send it to TensorFlow Serving as a Batch. After receiving the request, TensorFlow Serving will calculate the CTR estimate and return it to the requester.

When deploying the first version of TensorFlow Serving, QPS requires about 500 seconds for packaging requests, about 3 ms for network overhead, only 10 ms for model estimates, and 18 ms for TP50 lines for the whole process. The performance can not meet the online requirements. Next, we describe the process of performance optimization in detail.

3.2 Performance optimization

3.2.1 Request-side optimization

Online requester optimization mainly deals with 100 advertisements in parallel. We use OpenMP multithreading to process data in parallel, which reduces the request time performance from 5 ms to about 2 ms.

#pragma omp parallel for 
for (int i = 0; i < request->ad_feat_size(); ++i) {
    tensorflow::Example example;

OPS optimization of 3.2.2 construction model

Before optimization, the input of the model is the original format data which has not been processed. For example, the value of channel features may be string format such as’channel 1′,’channel 2′, and then one Hot processing is done in the model.

Initially, the model used a large number of high-order tf. feature_column to process the data, which was converted to One Hot and embedding formats. The advantage of using tf. feature_column is that there is no need to do any processing on the original data when input. There are many common processing of features in the model through feature_column API, such as tf. feature_column. bucketed_column can be bucketed, and tf. feature_column. crossed_column can do feature crossover for category features. But the pressure of feature processing is in the model.

In order to further analyze the time-consuming of feature_column, we use the tf. profile tool to analyze the time-consuming of the whole offline training process. Using tf. Profiler in the Estimator framework is very convenient, just add a line of code.

with tf.contrib.tfprof.ProfileContext(job_dir + ‘/tmp/train_dir’) as pctx:
   estimator = tf.estimator.Estimator(model_fn=get_model_fn(job_dir),

The following figure shows the time-consuming distribution of network propagation forward using tf. profiler. It can be seen that feature processing using feature_column API takes a lot of time.

Online Prediction of Deep Learning Based on TensorFlow Serving
Pre-optimization profiler records, the forward propagation time accounted for 55.78% of the total training time, mainly spent on feature_column OPS to preprocess the original data.

In order to solve the problem of time-consuming feature processing in the model, when processing offline data, we map all string format raw data to One Hot in advance, and drop the mapping relationship to the local feature_index file for online and offline use. This is equivalent to omitting the process of computing One Hot on the model side and replacing it with O (1) lookup using dictionaries. At the same time, when building models, use more low-order APIs with guaranteed performance to replace higher-order APIs such as feature_column. The figure below shows the proportion of forward propagation time consumed in the whole training process after performance optimization. It can be seen that the time-consuming proportion of forward propagation has been reduced a lot.

Online Prediction of Deep Learning Based on TensorFlow Serving
The optimized profiler records show that forward propagation takes 39.53% of the total training time.

3.2.3 XLA, JIT compilation optimization

TensorFlow uses directed data flow graph to express the whole calculation process, in which Node represents operation (OPS), and data is expressed by Tensor. Directed edges between different Nodes represent the direction of data flow. The whole graph is directed data flow graph.

XLA (Accelerated Linear Algebra) is a compiler specially designed to optimize linear algebraic operations in TensorFlow. When JIT (Just In Time) compilation mode is turned on, XLA compiler is used. The whole compilation process is shown in the following figure:

Online Prediction of Deep Learning Based on TensorFlow Serving
TensorFlow Computing Process

Firstly, the whole TensorFlow calculation graph will be optimized, and the redundant calculation in the graph will be cut off. HLO (High Level Optimizer) generates the original operation of HLO from the optimized calculation graph. The XLA compiler optimizes the original operation of HLO, and finally gives it to LLVM IR to generate different machine codes according to different back-end devices.

The use of JIT can help LLVM IR generate more efficient machine codes according to HLO original operations, and at the same time, for multiple merging HLO original operations, it will merge into a more efficient computing operation. However, JIT compilation is performed when the code is running, which also means that there will be some additional compilation overhead when the code is running.

Online Prediction of Deep Learning Based on TensorFlow Serving
Influences of Network Structure and Batch Size on JIT Performance [7]

The figure above shows the time-consuming ratio between JIT compiled and non-JIT compiled under different Batch Size network structures. It can be seen that the larger Batch Size performance optimization is obvious, and the number of layers and neurons has little effect on JIT compilation optimization.

In practical application, the specific effect will vary due to network structure, model parameters, hardware equipment and other reasons.

3.2.4 Final Performance

After a series of performance optimization, the estimated time of the model is reduced from 10 ms to 1.1 ms, and the request time is reduced from 5 ms to 2 ms. The whole process takes about 6 ms from packaging to sending requests to receiving results.

Online Prediction of Deep Learning Based on TensorFlow Serving
The time-related parameters of the model were QPS: 1308, 50line: 1.1ms and 999line: 3.0ms. The following four figures are as follows: time-consuming distribution chart shows that most of the time-consuming is controlled within 1ms; the number of requests shows about 80,000 requests per minute, equivalent to 1308 QPS; the average time-consuming is 1.1ms; and the success rate is 100%.

3.3 Model Switching Burr Problem

Through monitoring, it is found that when the model is updated, there will be a large number of request timeouts. As shown in the figure below, each update results in a large number of requests timeouts, which has a greater impact on the system. Through TensorFlow Serving log and code analysis, it is found that the problem of timeout mainly comes from two aspects. On the one hand, update, load model and threads handling TensorFlow Serving requests share a thread pool, which makes it impossible to process requests when switching model. On the other hand, after model loading, the computing graph adopts Lazy Initialization mode, which results in waiting for the first request to be counted. Graph initialization.

Online Prediction of Deep Learning Based on TensorFlow Serving
Model switching causes request timeout

Problem 1 is mainly due to the configuration of load and unload model thread pool. In the source code:

uint32 num_load_threads = 0;
uint32 num_unload_threads = 0;

By default, these two parameters are 0, indicating that they do not use a separate thread pool and run in the same thread as Serving Manager. Modification to 1 can effectively solve this problem.

The core operation of model loading is RestoreOp, which includes reading model files from storage, allocating memory, searching for corresponding Variables and so on. It is executed by calling run method of Session. By default, all Sessions in a process use the same thread pool. Therefore, the same thread pool is used for loading operations and processing Serving requests during model loading, which results in delays in Serving requests. The solution is to construct multiple thread pools through configuration file settings, and to specify the use of independent thread pools to perform load operations when the model is loaded.

For the second problem, when the model first runs for a long time, the Warm Up operation can be carried out in advance after the model is loaded, which can avoid the impact of the operation on the request performance. Warm Up is used here to initialize the model by taking out the type of input data according to the signature set when exporting the model, and then constructing the false input data.

By optimizing the above two aspects, the problem of request delay after model switching is well solved. As shown in the figure below, the burr is reduced from 84 MS to about 4 ms when switching the model.

Online Prediction of Deep Learning Based on TensorFlow Serving
After model switching, the burr decreases.

Summary and Prospect

This paper mainly introduces the exploration of user growth group based on Tensorflow Serving in deep learning online prediction, the positioning, analysis and solution of performance problems, and finally realizes online service with high performance, strong stability and supporting various deep learning models.

With a complete framework of off-line training and online prediction, we will accelerate the rapid iteration of strategies. In terms of models, we can quickly try new models and try to combine reinforcement learning with bidding; in terms of performance, combined with engineering requirements, we will further explore TensorFlow’s graph optimization, underlying operators, operation integration, etc. In addition, TensorFlow Serving’s predictive function can be used for model analysis, based on which Google launched What-If-Tool. S helps model developers to analyze the model in depth. Finally, we will combine the model analysis to re-examine the data and features.


[1] Cheng, H. T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., … & Anil, R. (2016, September). Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (pp. 7-10). ACM.
[2] Wang, R., Fu, B., Fu, G., & Wang, M. (2017, August). Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17 (p. 12). ACM.
[3] Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247.
[4] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. arXiv preprint arXiv:1803.05170.
[5] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … & Kudlur, M. (2016, November). TensorFlow: a system for large-scale machine learning. In OSDI (Vol. 16, pp. 265-283).
[6] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., … & He, K. (2017). Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
[7] Neill, R., Drebes, A., Pop, A. (2018). Performance Analysis of Just-in-Time Compilation for Training TensorFlow Multi-Layer Perceptrons.

Author’s Brief Introduction

Zhongda graduated from the University of Rochester in Data Science in 2017, then worked in Stentor Technology Company, Bay Area, California, and joined the American League in 2018. He was mainly responsible for the user growth group’s in-depth learning and intensive learning of landing business scenarios.

Hongjie, joined the American League in 2015 for comment. As a technical leader, he has led the algorithm work of the AD project of the delegation, such as putting in DSP advertisement and pulling new products in the station, effectively improving the marketing efficiency and reducing the marketing cost.The algorithm leader of the delegation platform and the user growth group of the liquor industry group has worked in Ali, and mainly devoted himself to improving the number of active users of the delegation evaluation platform through machine learning.

Ting Wen, joined the American League in 2015. YARN resource scheduling and GPU computing platform construction have been successively carried out in the off-line computing direction of the delegation comment.


The company’s DSP is the core business direction of the company’s online digital marketing. Join us, you can personally participate in creating and optimizing a marketing platform that can reach hundreds of millions of users, and guide their life and entertainment decisions. At the same time, you will also face the challenges of accurate, efficient and low-cost marketing, and have the opportunity to access the AI algorithm system and big data solutions at the forefront of computing advertising. You will work with the company’s marketing technology team to promote the establishment of traffic operation ecology, and support the rapid development of liquor hotels, takeout, arrival, taxi, finance and other businesses. We sincerely invite you who are passionate, thoughtful, experienced and capable to work with us! Participate in the implementation of the delegation’s comment on the off-site advertising delivery system. Based on large-scale user behavior data, optimize online advertising algorithm, enhance DAU, ROI, and improve the relevance and delivery effect of online advertising. Welcome to for consultation.

Online Prediction of Deep Learning Based on TensorFlow Serving