How can wechat “sweep” object recognition training break the game gracefully under the pressure of data set explosion?


The wechat “sweep” identification went online for a period of time. In the early stage, it mainly used the commodity map (shoes / bags / Beauty / clothing / household appliances / toys / books / food / Jewelry / furniture / others) as the medium to explore the valuable information in the wechat content ecology, and expanded to the identification of various vertical fields, including the identification of plants / animals / cars / fruits and vegetables / wine labels / vegetables / landmarks, etc, The recognition core relies on the convolutional neural network model of deep learning. With the growth of tens of millions of data and more and more model parameters every day, it takes about a week for in-depth learning and training. How to quickly train the optimization model and put it online has become an urgent problem to be solved.

1、 Introduction

Today, relying on the powerful computing power of GPU, deep learning has developed rapidly. It has set off an unprecedented revolution in the field of image processing and speech recognition. Compared with the traditional methods, the deep learning method represented by convolutional neural network (CNN) can highly focus on the characteristics of data, and has achieved a dominant position in the field of image processing.

With the continuous increase in the amount of daily adjustment of scanning objects, the picture data is growing by tens of millions of levels every day. In this era of racing against time, those who get the data win the world. At the same time, the complexity of neural network is increasing explosively. For example, the RESNET model of image classification proposed by Microsoft in 15 years has 7 exaflops / 60 million parameters, and the neural network machine translation model of Google in 17 years has 10 exaflops / 8.7 billion parameters.

In most scenarios, the model can be trained on a GPU server using one or more GPUs. However, with the increase of the data set, the training time increases accordingly. Sometimes the training takes a week or even longer. Therefore, how to quickly iteratively optimize the deep learning model has become an urgent problem for our algorithm developers.

The following will explain the selection of distributed training methods and the principle of multi machine communication technology, and introduce the distributed training of the deep learning model in wechat scanning and object recognition through the distributed training and experimental results on the wechat self research platform based on the framework of horovod.

2、 Distributed training

1. Parallel mode

Compared with single machine and single card, the time of model training can be greatly shortened. Generally, one server only supports 8 GPU cards, and the distributed multi machine and multi card training method can schedule dozens or even hundreds of servers to train a model together, so as to further break through the upper limit of model training.

According to the distributed parallel training mode, distributed training is generally divided into data parallel and model parallel.

(1) Data parallel

Different GPUs in the distributed system have a complete copy of the same model. Each GPU only obtains different parts of the whole data, and then combines the neural network gradients of all GPUs in a synchronous or asynchronous manner.

(2) Model parallelism

In the distributed system, each GPU uses the same data, only part of the model is distributed on all GPUs, and the active part of the neural network is exchanged in the training process.

How can wechat

Because each part of the parallel model has certain dependence, the number of GPUs can not be increased at will, the scale scalability is poor, and it is not used much in practical training. The data is parallel, each part is independent, the scale scalability is good, it is more commonly used in actual training, and the speed-up effect is better.In terms of implementation, fault tolerance and good cluster utilization, data parallelism is better than model parallelism

2. System architecture

Distributed training system architecture mainly includes two types: parameter server architecture (common PS architecture, parameter server) and ring all reduce architecture.

(1) Parameter server schema

In PS architecture, nodes in the cluster are divided into two categories: parameter server and worker. The parameter server stores the parameters of the model, and the worker is responsible for calculating the gradient of the parameters.

In each iteration process, the worker obtains parameters from the parameter sever, and then returns the calculated gradient to the parameter server. The parameter server aggregates the gradient returned from the worker, then updates the parameters and broadcasts the new parameters to the worker.

(2) Ring all reduce architecture

In the ring all reduce architecture, each device is a worker and forms a ring. There is no central node to aggregate the gradients calculated by all workers. In an iterative process, each worker completes its own mini batch training, calculates the gradient, and passes the gradient to the next worker in the ring. At the same time, it also receives the gradient from the previous worker. For a ring containing N workers, each worker can update the model parameters after receiving the gradient of other N-1 workers.

How can wechat

Distributed computing using PS computing model usually encounters network problems. With the increase of the number of workers, its acceleration ratio will deteriorate rapidly.Compared with PS architecture, the network traffic of ring all reduce architecture does not increase with the increase of workers (GPU), which is a constant value. The bandwidth of each node in the cluster is fully utilized

3. Parameter update

(1) Synchronous update

All GPUs exchange and fuse with the parameter server at the same time point. In each round of training, it is necessary to summarize the gradient values obtained from all worker training, and then take the average value to update the model parameters on the parameter server.

(2) Asynchronous update

All GPUs communicate, exchange and fuse with the parameter server independently. Before each round of training, each worker obtains model parameters from the parameter server, reads training data and carries out training. After training, gradient will be applied immediately to update the model parameters on the parameter server.

Asynchronous update communication is efficient and fast, but it often converges poorly, because some slow nodes always provide outdated and wrong gradient directions. The communication efficiency of synchronous update is low, and the training is usually slow, but the training convergence is stable, because synchronous update is basically equivalent to the batch size training with a single card. However, the traditional synchronous update method (the way that each GPU card calculates the gradient and calculates the average by summation) will produce a huge amount of communication data when fusing the gradient.

By comparing different distributed parallel modes, system architecture and parameter updates,Wechat scans and identifies objects, and finally selects ring all reduce distributed training method based on data parallel parameter synchronous update

3、 Multi computer communication technology

Compared with single machine and multi card, multi machine and multi card distributed training should ensure that multiple machines can communicate with each other and the gradient can be transferred between different machines.

The communication of parallel tasks can generally be divided into point-to-point communication and collective communication. Point to point communication has only one sender and one receiver, which is relatively simple to implement. When it comes to distributed training, it is generally multiple servers, using the collective communication mode, including multiple senders and multiple receivers. The common communication modes of collective communication mainly include the following: broadcast, gather, scatter, reduce, all reduce, etc.

How can wechat

1. MPI

In the self-developed training platform of wechat, multi machine communication is realized based on message passing interface (MPI). MPI is a parallel programming technology based on information transmission, defines a set of portable programming interfaces, and is a programming interface standard.

In the MPI based programming model, computing is composed of one or more processes that communicate with each other by calling library functions. The communicator in MPI defines a set of processes that can send messages to each other. In this group of processes, each process will be assigned a sequence number called rank, and the processes explicitly communicate by specifying the rank. Some operations involved in MPI include data movement, aggregation, synchronization, etc.

Because most of the deep learning training parameters are on the GPU, if you only rely on MPI to synchronize the parameters, the parameters need to be moved from the GPU to the CPU, and then communicate between CPUs of different machines. After the communication, the parameters are moved from the CPU to the GPU. The communication efficiency of this process is very low. Therefore, in order to improve the communication efficiency, nccl based on NVIDIA is used for communication in the process of training.


Nccl is the abbreviation of NVIDIA collective multi GPU communication library. It is a library developed by NVIDIA that can realize multi GPU collective communication. It can be easily integrated into any deep learning training framework. Many optimizations have been made in the implementation of allreduce, reduce, broadcast, allgather, etc., which can achieve high communication speed on PCIe, nvlink and Infiniband.

At present, nccl1.0 only supports single machine and multiple cards, and the cards communicate through PCIe, nvlink, GPU direct P2P. Nccl 2.0 supports multiple computers and multiple cards. Multiple computers communicate through sockets (Ethernet) or Infiniband with GPU direct RDMA.

4、 Horovod training framework

At present, there are many distributed training frameworks,Horovod is an open source deep learning tool for Uber, including tensorflow, keras, pytorch, and Apache mxnet.

Moreover, horovod’s gradient synchronization and weight synchronization use all reduce algorithm based on MPI and nccl, rather than parameter server architecture, so the communication efficiency is higher. Horovod can take advantage of nvlink, RDMA, gpudirectrdma, automatic detection of communication topology, and fallback to PCIe and TCP / IP communication. At the same time, the existing training code is changed into distributed training code, which has less changes and simplifies the operation and startup of distributed training.

Based on this, the distributed training framework of horovod is selected for wechat scanning and object recognition, and the training is carried out on the wechat self-developed training platform.

How can wechat

Horovod’s multi machine communication initialization is based on MPI, which initializes the communication environment and process allocation through MPI. There are several common environmental parameters:

  • Size: number of processes, that is, the number of GPUs;
  • Rank: unique ID of the process, 0-size;
  • Local size: the number of local processes per worker;
  • Local rank: unique local ID of each worker’s process.

These parameters are used to control the communication between machine processes.

Because the training adopts the data parallel mode, it is necessary to sample the data distributed. Horovod can directly call the distributed sampling function provided by pytorch.

How can wechat

This method can be applied to simple distributed training tasks. However, during the retrieval training of object recognition, we hope that the dataloader can do some balanced sampling or triplet sampling. The above sampler only supports distributed sampling.

Because there is a mutually exclusive relationship between some initialization parameters of dataloader of pytorch, if the sampler is customized, these parameters are batch_ size、shuffle、batch_ sampler、drop_ Last must use the default value. So we rewrite batch_ Sampler: pass the distributed sampler as a parameter to the newly constructed batch_ sampler。

Horovod implements broadcast operation internally, so that the model can realize consistent initialization in all working processes. When loading model weights, you can synchronize parameters to other machines for weight initialization as long as you load them on the rank0 machine, and then use the broadcast mechanism.

In the training process, the calculation of the loss function needs to involve the allreduce operation, reduce the loss of all workers, and then carry out gradient propagation. Finally, when saving the model, just specify a machine to save the model.

5、 Experimental results

In addition to the communication in the training phase of distributed training, the IO of data also needs to be considered. The retrieval model of scanning objects is trained based on a large number of image data. During distributed training, each machine needs to be able to read these training data, and the picture files are stored on the wechat self research distributed storage system.

During training, the speedup ratio of distributed training is positively correlated with the number of GPUs. Test the running time of distributed training based on resnet50 on MNIST data set. It takes 78 min for a single machine to run 100 epochs of MNIST, and 23 min for multiple machines to train 100 epochs with 4 GPUs.

In the model training of our actual project, based on distributed training, the previous training time of 5 days or even a week can be shortened to less than 1 day. In the same time, algorithm developers can explore more experiments and quickly feed back and update, which greatly improves the efficiency of algorithm research and development.

6、 Summary and Prospect

At present, scanning objects can successfully carry out distributed training on the wechat self-developed training platform, but there are still the following problems: how to efficiently store and read a large number of small pictures and files to reduce the time-consuming of Io. We will explore these problems in the follow-up work.


[1] Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C]//11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 2014: 583-598.