1 Introduction
As neural networks are being widely applied to server and edge computing, both training and inference need to become more and more efficient in terms of runtime, energy consumption and memory cost. On both servers and edge devices, it is critical to reduce computation cost in order to enable fast training, e.g., online training of neural networks for click prediction [6, 7], and fast inference, e.g., click prediction at less than 10ms latency constraints [12] and realtime video processing at 30 frames per second [11]. Reducing computation cost is also beneficial to reducing energy consumption in those systems since the energy consumption of GPU is mostly proportional to runtime [8].
Especially, training is constrained by the memory capacity of GPU. The large batch of a deep and wide model requires large memory during training. For instance, training of a neural network for vision tasks on highend smartphones or selfdriving cars having 4K images requires 24MB only for the input to the first layer of the network. During training, we need to store the activations of all the intermediate layers. Considering that the size of activation, at each intermediate layer, is comparable to that of input, the required memory size (batch size x total activation size of the network) can easily exceed the memory capacity of stateoftheart GPU.
Reduced precision has potential to resolve the problems of runtime, energy consumption and memory cost by reducing the data size thereby enabling more parallel and energyefficient computation, e.g., four int8 operations instead of a single fp32 operation, at a smaller memory footprint. The stateoftheart techniques of quantization are 16bit training[4] and 8bit inference[16]. Considering the trend of everincreasing demand for training and inference on both servers and edge devices, further optimizations in quantization, e.g., 4 bits, will be more and more required.
In this paper, we propose a novel quantization method based on the fact that the distributions of weights and activations have the majority of data concentrated in narrow regions while having a small number of large data scattered in large regions. By exploiting the fact, we apply reduced precision only to the narrow regions thereby reducing quantization errors for the majority of data while separately handling large data in high precision. For very deep networks such as ResNet152 and DenseNet201, our proposed quantization method enables training with 3bit activations (2% large data). Our method also offers lowprecision inference with 4 to 5bit weights and activations (1% large data) even for optimized networks such as SqueezeNet1.1 and MobileNetv2 as well as deeper networks.
2 Related Work
Recently, there have been presented several methods of memoryefficient training. In [2], Chen et al. propose a checkpointing method of storing intermediate activations of some layers to reduce memory cost of storing activations and recalculating the other activations during backpropagation. In [5], Gomez et al. present a reversible network which, during backpropagation, recomputes input activations utilizing output activations thereby minimizing the storage of intermediate activations. The existing methods of checkpointing and reversible network are effective in reducing memory cost. However, they have a common critical limitation, the additional computation to recompute activations during backpropagation. Considering that computation cost determines runtime and energy consumption of training on GPUs, the additional computation cost needs to be minimized. As will be explained in the experiments, our proposed quantization method gives much smaller cost in both memory and computation than the stateoftheart ones. More importantly, it has a potential of offering less computation cost than the conventional training method.
The stateoftheart quantization methods of training and inference for deep networks, e.g., ResNet152 are 16bit training [4] and 8bit inference [16]. In [4], Ginsburg et al. propose 16bit training based on loss scaling (for small activations or local gradients) and fp32 accumulation. In [16], Migacz proposes utilizing KullbackLeibler (KL) divergence in determining the linear range to apply 8bit quantization with clipping. There are studies towards more aggressive quantization for training, e.g., [27, 3]. In [3], De Sa et al. propose bit centering to exploit the fact that gradients tend to get smaller as training continues. However, these aggressive methods are limited to small networks and do not preserve fullprecision accuracy for very deep models such as ResNet152.
We classify quantization methods for inference into two types, linear and nonlinear ones. The linear methods utilize uniform spacing between quantization levels, thereby being more hardware friendly, while the nonlinear ones have nonuniform spacing mostly based on clustering. As the simplest form of linear quantization, in
[24], Rastegari et al. show that a weight binarization of AlexNet does not lose accuracy. In
[9], Hubara et al. propose a multibit linear quantization method to offer a tradeoff between computation cost and accuracy. In [27], Zhou et al. propose a multibit quantization which truncates activations to reduce quantization errors for the majority of data.In [19], Miyashita et al. propose logarithmbased quantization and show that AlexNet can be quantized with 4bit weights and 5bit activations at 1.7% additional loss of top5 accuracy. In [29], Zhu et al. show that deep models can be quantized with separately scaled ternary weights while utilizing fullprecision activations. In [20], Park et al. propose a clustering method based on weighted entropy and show 5bit weight and 6bit activation can be applied to deep models such as ResNet101 at less than 1% additional loss of top5 accuracy. In [28], Zhou et al. propose a clustering method called balanced quantization which tries to balance the frequency of each cluster thereby improving the utility of quantization levels. Recently, several studies report that increasing the number of channels [18] and adopting teacherstudent models [17, 22] help to reduce the accuracy loss due to quantization. These methods can be utilized together with our proposed quantization method.
Compared with the existing quantization methods, our proposed method, which is a linear method, enables smaller bitwidth, effectively 4 bits for inference in very deep networks such as ResNet101 and DenseNet121 for which there is no report of accurate 4bit quantization in the existing works.
3 Motivation
Figure 1 (a) and (b) illustrate the distributions (yaxis in log scale) of activations and weights in the second convolutional layer of GoogLeNet. As the figures show, both distributions are wide due to a small number of large data. Given a bitwidth for low precision, e.g., 3 bits, the wider the distribution is, the larger quantization errors we obtain. Figure 1 (c) exemplifies the conventional 3bit linear quantization applied to the distribution of activations in Figure 1 (a). As the figure shows, the spacing between quantization levels (vertical bars) is large due to the wide distribution, which incurs large quantization errors.
When comparing Figure 1 (a) and (c), it is clear that the majority of quantization levels is not fully utilized. Especially, the levels assigned to large values have much fewer data than those assigned to small values, which motivates our idea. Figure 1 (d) illustrates our idea. We propose applying low precision only to small data, i.e., the majority of data, not all. As the figure shows, the spacing between quantizaiton levels gets much smaller than that in the conventional linear quantization in Figure 1 (c). Such a small spacing can significantly reduce quantization error for the majority of data. Large data have the larger impact on the quality of network output. Thus, we propose handling the remaining large data in high precision, e.g., in 32 or 16 bits. The computation and memory overhead of handling highprecision data is small because their frequency, which is called the ratio of large activations, in short, activation ratio (AR), is small, e.g., 13% of total activation data.^{1}^{1}1We use two ratios of large data, one for large weights and the other for large activations. We use AR to denote the ratio of large activations.
4 Proposed Method
Our basic approach is first to perform value profiling to identify large data during training and inference. Then, we apply reduced precision to the majority of data, i.e. small ones while keeping high precision for the large data. We call this method valueaware quantization (VQuant).
We apply VQuant to training to reduce the memory cost of activations. We also apply it to inference to reduce the bitwidth of weights and activations of the trained neural network. To do that, we address new problems as follows.

(Sections 4.1 and 4.2) In order to prevent the quality degradation of training results due to quantization, we propose a novel scheme called quantized activation backpropagation, in short, quantized backpropagation. We apply our quantization only to the activations used in the backward pass of training and perform forward pass with fullprecision activations.

) We present new methods for further reduction in memory cost of training. In order to reduce the overhead of mask information required for ReLU function during backpropagation, we propose ReLU and valueaware quantization. For further reduction in memory cost, we also propose exploiting the fact that, as training continues, the less amount of large activations is required.
4.1 Quantized BackPropagation
Figure 2 shows how to integrate the proposed method with the existing training pipeline. As the figure shows, we add a new component of valueaware quantization to the existing training flow. In the figure, thick arrows represent the flow of fullprecision activations (in black) and gradients (in red).
First, we perform the forward pass with fullprecision activations and weights, which gives the same loss as that of the existing fullprecision forward pass (step 1 in the figure). During the forward pass, after obtaining the output activations of each layer, e.g., layer , the next layer (layer ) of network takes as input the fullprecision activations. Then, we apply our quantization method to them (those of layer ) in order to reduce their size (step 2). As the result of the forward pass, we obtain the loss and the quantized activations.
During the backward pass, when the activations of a layer are required for weight update, we convert the quantized, mostly lowprecision, activations, which are stored in the forward pass, into fullprecision ones (step 3). Note that this step only converts the data type from low to high precision, e.g., from 3 to 32 bits. Then, we perform weight update with backpropagated error (red thick arrow) and the activations (step 4).
Note that there is no modification in the computation of the existing forward and backward passes. Especially, as will be explained in the next subsection, when ReLU is used as activation function, the backward error propagation (step 5 in the figure) keeps fullprecision accuracy. The added component of valueaware quantization performs conversions between fullprecision and reducedprecision activations and compresses a small number of remaining large highprecision activations, which are sparse, utilizing a conventional sparse data representation, e.g., compressed sparse row (CSR).
The conversion from full to reduced precision (step 2) reduces memory cost while that from reduced to full precision (step 3) changes data type back to full precision one thereby increasing memory cost back to that of full precision. Note that the fullprecision activations, obtained from the quantized ones, are discarded after weight update for their associated layer. Thus, we need memory resource for the stored quantized activations of the entire network and the fullprecision input/output activations of only one layer, which we call working activations, for the forward/backward computation.
As will be explained later in this section, for further reduction in memory cost, the ReLU function consults the valueaware quantization component for the mask information which is required to determine to which neuron to backpropagate the error (step 6).
4.2 BackPropagation of FullPrecision Loss
Our proposed method can suffer from quantization error in weight update since we utilize quantized activations. We try to reduce the quantization error by applying reduced precision only to narrow regions having the majority of data while separately handling the large data in high precision.
Moreover, in stateoftheart networks where ReLU is utilized as activation function, the backpropagated error is not affected by our quantization of activations as is explained below. Equation (1) shows how we calculate weight update during backpropagation for a multilayer perceptron (MLP).
(1) 
where represents the update of weight from neuron (of layer ) to neuron (of layer ), learning rate, the local gradient of neuron (backpropagated error to this neuron), and the activation of neuron . Equation (1) shows that the quantization error of activation can affect the weight update. In order to reduce the quantization error in Equation (1), we apply VQuant to activations .
The local gradient is calculated as follows.
(2) 
where represents the derivative of activation function, the input to neuron and the weight between neuron (of layer ) to neuron (of layer ). Equation (2) shows that the local gradient is a function of the input to neuron, which is the weighted sum of activations. However, if ReLU is used as the activation function, then becomes 1 yielding , which means the local gradient becomes independent of activations. Thus, aggressive quantizations of intermediate activations, e.g., 3bit activations can hurt only the weight update in Equation (1), not the local gradient in Equation (2). This is the main reason why our proposed method can offer fullprecision training accuracy even under aggressive quantization of intermediate activations as will be shown in the experiments.
4.3 Potential of Further Reduction in Computation Cost
Compared with the existing methods of low memory cost in training [2][5], our proposed method reduces computation cost by avoiding recomputation during backpropagation. More importantly, our proposed method has a potential of further reduction in computation cost especially in Equation (1). It is because the activation is mostly in low precision in our method. Thus, utilizing the capability of 8bit multiplication on GPUs, our method can transform a single 16bit x 16bit multiplication in Equation (1) into an 8bit x 16bit multiplication. In stateoftheart GPUs, we can perform two 8bit x 16bit multiplications at the same computation cost, i.e., execution cycle, of one 16bit x 16bit multiplication, which means our proposed method can double the performance of Equation (1) on the existing GPUs.
Assuming that the forward pass takes multiplications, the backward pass takes 2 multiplications while each of Equations (1) and (2) taking multiplications, respectively. Thus, the 2x improvement in computation cost of Equation (1) can reduce by up to 1/6 total computation cost of training. In order to realize the potential, further study is needed to prove that our proposed method enables 8bit lowprecision activations (with a small number of 16bit highprecision activations) without losing the accuracy of 16bit training [4].
Although our method can currently reduce computation cost utilizing only 8bit multiplications on GPUs, its reducedprecision computation, e.g., 3bit multiplications, offers opportunities of further reduction in computation cost for training in future hardware platforms supporting aggressively low precision, e.g., [25].
4.4 Local Sorting in Data Parallel Training
VQuant requires sorting activations. Assuming that we adopt data parallelism in multiGPU training, the sorting can incur significant overhead in training runtime since it requires exchanging the activations of each layer between GPUs. What is worse, in reality, such a communication is not easily supported in some training environments, e.g., PyTorch. In order to address the problem of activation exchange, we propose performing sorting locally on each GPU, which eliminates interGPU communication for activation exchange. Then, each GPU performs VQuant locally by applying the same AR, i.e., the same ratio of large activations. Compared with the global solution that collects all the activations and applies the AR to the global distribution of activations, the proposed local solution can lose accuracy in selecting large values. However, our experiments show that the proposed method of local sorting works well, which means that the selection of large values does not need to be accurate.
4.5 ReLU and Valueaware Quantization (RVQuant)
The error is backpropagated through the neurons the output activations of which are nonzero. When ReLU is adopted as activation function, the output activations often become zero. In such a case, in order to identify which neurons to propagate errors to, we need a bit mask, i.e., 1bit memory cost for a neuron. In case that the activations are quantized at a very small number of bits, e.g., 3 bits, the overhead of the bit mask is significant, e.g., one additional bit for 3bit activation on each neuron. In order to reduce the overhead of mask information, we exploit the fact that each neuron needs to have either the output activation (for weight update) or the mask information (to block error backpropagation), not both at the same time. Thus, given bits for low precision, we allocate one of quantization levels to the mask information while representing the activation value with levels. We call this quantization ReLU and valueaware quantization (RVQuant). As will be shown in the experiments, RVQuant removes the overhead of bit mask while keeping training accuracy.
4.6 Activation Annealing
According to our investigation, the required amount of large activations varies across training phases. To be specific, the early stage of training tends to require more large activations while the later stage tends to need less large activations. We propose exploiting the fact and adjusting AR in a gradual manner from large to small AR across training phases, which we call activation annealing. As will be shown in the experiments, activation annealing can maintain training quality while reducing the average memory cost across the entire training phases.
4.7 Quantized Inference
In order to obtain quantized neural networks for inference, we perform VQuant as a postprocessing of training, i.e., we apply VQuant to the weights and activations of trained networks. In order to recover from the accuracy loss due to quantization, we perform finetuning as follows. We perform forward pass while utilizing the quantized network, i.e., applying VQuant to weights and activations. During backpropagation, we update fullprecision weights. As will be shown in the experiments, the finetuning incurs a very small overhead in training time, i.e., only a few additional epochs of training. Note that we apply local sorting in Section
4.4 to avoid communication overhead when multiple GPUs are utilized in finetuning.During finetuning, we evaluate candidate ratios for large weights and activations and, among those candidates, select the best configuration which minimizes the bitwidth while meeting accuracy requirements. Note that, as will be explained in the experiments, the total number of candidate combinations is small.
In order to identify large activations meeting the AR, we need to sort activations, which can be expensive in inference. In order to avoid the sorting overhead, we need lowcost sorting solutions, e.g., sampling activations to obtain an approximate distribution of activations. Detailed implementations of quantized models including the lowcost sorting are beyond the scope of this paper and left for further study.
5 Experiments
We evaluate our proposed method on ImageNet classification networks, AlexNet, VGG16, SqueezeNet1.1, MobileNetv2, Inceptionv3, ResNet18/50/101/152 and
DenseNet121/201. We test the trained/quantized networks with ILSVRC2012 validation set (50k images) utilizing a single center crop of 256x256 resized image. We also use an LSTM for wordlevel language modeling [26, 23, 10]. We implemented our method on PyTorch framework [21] and use the training data at Torchvision [14].
The initial learning rate is set to 0.1 (ResNet18/50/152 and DenseNet201), or 0.01 (AlexNet and VGG16). The learning rate is decreased by 10x at every multiple of 30 epochs and the training stops at 90 epochs. In SqueezeNet1.1, MobileNetv2 and Inceptionv3, we use the same parameters in the papers except that we use a minibatch of 256 and SGD instead of RMSprop. In addition, we replace ReLU6 in MobileNetv2 with ReLU to apply VQuant.
We apply VQuant and RVQuant to training to minimize memory cost. During training, in order to compress the sparse large activations on GPU, we use the existing work in [1]. In order to obtain quantized networks for inference, we perform finetuning with VQuant for a small number of additional epochs, e.g., 13 epochs after total 90 epochs of original training.
We compare classification accuracy between fullprecision models and those under RVQuant (training) and VQuant (training/inference). For each network, we use the same randomly initialized condition and perform training for different RVQuant and VQuant configurations.
5.1 Training Results
Table 1 shows top1/top5 accuracy of ResNet50 obtained, under VQuant, varying the bitwidth of lowprecision activation and the ratio of large activation, AR. The table shows that the configuration of 3bit activations with the AR of 2% (in bold) gives training results equivalent to the fullprecision (32bit) training in terms of top1 accuracy, which corresponds to 6.1X (=1/((3+1)/32 + 0.04)) reduction in the memory cost of stored activation at the same quality of training.^{2}^{2}2Note that VQuant still requires 1bit mask information for each neuron. In addition, the sparse data representation of large data, e.g., CSR doubles the size of the original sparse data yielding the memory cost of 4% with the AR of 2%. The table also shows that a very aggressive quantization of 2bit activation and 1% AR loses only 0.264%/0.246% in top1/top5 accuracy, which is comparable to the case of 5bit quantization without large data (5bit with AR 0% in the table).
Note that the total memory cost of activations includes that of stored activations of the entire network and that of fullprecision working activations (input to the associated layer) required for weight update. Thus, the abovementioned reduction of 6.1X is only for the memory cost of stored activations. We will give the comparison of total memory cost of activations later in this section.
AR [%]  0  1  2  3  4  5 
1bit  5.302 / 15.228  74.510 / 92.048  75.172 / 92.500  75.214 / 92.482  75.698 / 92.656  75.568 / 92.662 
2bit  65.754 / 86.718  75.652 / 92.658  75.638 / 92.702  75.660 / 92.512  75.338 / 92.660  75.576 / 92.615 
3bit  75.486 / 92.608  75.708 / 92.592  75.920 / 92.858  75.930 / 92.964  75.892 / 92.938  75.734 / 92.630 
4bit  75.700 / 92.750  75.784 / 92.670  75.880 / 92.926  75.790 / 92.712  75.846 / 92.694  75.916 / 92.858 
5bit with AR 0 %  75.600 / 92.610  6bit with AR 0 %  75.922 / 92.832  
7bit with AR 0 %  75.887 / 92.792  8bit with AR 0 %  75.670 / 92.846 
Table 2 shows top1/top5 accuracy of ResNet50 under RVQuant. As the table shows, RVQuant gives similar results to VQuant, e.g., top1 accuracy of 3bit 2% RVQuant gives an equivalent result to full precision. Compared with VQuant, RVQuant reduces the memory cost by 1 bit per neuron. Thus, the configuration of 3bit 2% RVQuant gives 7.5X (=1/(3/32 + 0.04)) reduction in the memory cost of stored activations. In addition, we can further reduce the memory cost of stored activations by applying traditional compression techniques to the reducedprecision activations. In the case of 3bit 2% RVQuant for ResNet50, by applying LempelZiv compression, we can further reduce the memory cost of the 3bit data by 24.4%, which corresponds to 9.0x reduction in the memory cost of the whole stored activations.
AR [%]  0  1  2  3  4  5 
2bit  35.518 / 60.864  75.338 / 92.560  75.408 / 92.490  75.666 / 92.594  75.498 / 92.460  75.272 / 92.646 
3bit  75.156 / 92.548  75.876 / 92.798  75.932 / 92.698  75.658 / 92.744  75.906 / 92.752  75.488 / 92.580 
Table 3 compares the accuracy of neural networks under fullprecision training and two RVQuant configurations. As the table shows, 3bit 2% RVQuant gives almost the same training accuracy as fullprecision training for all the networks.
AlexNet  ResNet18  SqueezeNet1.1  MobileNetv2  VGG16  Inceptionv3  ResNet152  DenseNet201  
Full  56.354 / 79.020  69.908 / 89.384  58.672 / 81.052  70.104 / 89.736  71.862 / 90.484  74.194 / 91.920  77.954 / 94.024  77.418 / 93.586 
3bit 2%  56.142 / 78.986  69.920 / 89.230  58.528 / 80.942  70.116 / 89.764  71.744 / 90.462  74.140 / 91.916  77.758 / 93.894  77.276 / 93.442 
8bit 0%  56.238 / 78.948  70.010 / 89.276  58.750 / 81.290  70.294 / 89.638  71.774 / 90.660  74.224 / 92.084  78.354 / 93.948  77.320 / 93.508 
Table 4 compares the total memory cost of activations (both stored quantized and fullprecision working activations) in training with 256 minibatch size. We compare two existing methods and three RVQuant configurations. ’Full’ represents the memory cost of conventional training with fullprecision activation. As a baseline, we use the checkpointing method of Chen et al. [2] since it is superior to others including [5], especially for deep neural networks. We calculate the memory cost of the checkpointing method to account for the minimum amount of intermediate activations to recompute correct activations while having the memory cost of O() where is the number of layers [2].
The table shows that, compared with the checkpointing method, RVQuant gives significant reductions in the total memory cost of activations. For instance, in the case of ResNet152 which is favorable to the checkpointing method due to the simple structure as well as a large number of layers, ours reduces the memory cost by 41.6% (from 5.29GB to 3.09GB). In networks having more complex subnetworks, e.g., Inception modules, ours gives more reductions. In the case of Inceptionv3, ours gives a reduction of 53.7% (3.87GB to 1.79GB). Note that in the case of AlexNet, the reduction is not significant. It is because the input data occupy the majority of stored activations and we store them in full precision. However, the impact of input data storage diminishes in deep networks.
We also measured the training runtime of ResNet50 with minibatch of 64 on NVIDIA Tesla M40 GPU. Compared to the runtime of existing fullprecision training, our method requires a small additional runtime, 8.8% while the checkpointing method has much larger runtime overhead, 32.4%. Note that as mentioned in Section 4.3, our method has a potential of further reduction in training time on hardware platforms supporting reducedprecision computation.
AlexNet  ResNet18  SqueezeNet1.1  MobileNetv2  ResNet50  VGG16  Inceptionv3  ResNet152  DenseNet201  
Full  0.35  1.86  1.58  7.34  9.27  9.30  9.75  20.99  24.53 

Chen et al. [2]  x  0.98 (52.1 %)  1.05 (66.9 %)  4.21 (52.1 %)  3.70 (39.9 %)  x  3.87 (39.8 %)  5.29 (25.2 %)  6.62 (27.0 %) 
(2,0)  0.23 (66.4 %)  0.42 (22.6 %)  0.59 (37.5 %)  0.74 (10.0 %)  1.22 (13.2 %)  3.65 (39.2 %)  1.16 (11.9 %)  1.64 (7.78 %)  2.09 (8.51 %) 
(3,0)  0.23 (67.8 %)  0.46 (24.3 %)  0.61 (38.8 %)  0.84 (11.4 %)  1.34 (14.5 %)  3.75 (40.3 %)  1.43 (14.8 %)  2.27 (10.8 %)  2.85 (11.6 %) 
(3,2)  2.40 (69.5 %)  0.50 (26.5 %)  0.64 (40.4 %)  1.13 (15.4 %)  1.52 (16.4 %)  3.88 (41.7 %)  1.79 (18.4 %)  3.09 (14.7 %)  3.83 (15.6 %) 
Table 5 shows the impact of RVQuant configurations on training accuracy of ResNet50. We change the configurations when the learning rate changes (with the initial value of 0.1) at 0.01 and 0.001. For instance, (F)(3,2)(2,0) represents the case that, as the initial configuration, we use fullprecision activation (F) during backpropagation. After 30 epochs, the configuration is changed to 3bit 2% RVQuant. Then, after 60 epochs, it is changed to 2bit 0% RVQuant.
Configuration  Accuracy  Configuration  Accuracy  Configuration  Accuracy  Configuration  Accuracy 
(3,2)(2,1)(2,0)  75.012 / 92.424  (2,0)(2,1)(3,2)  47.348 / 72.314  (3,2)(3,1)(3,0)  75.720 / 92.694  (3,0)(3,1)(3,2)  75.604 / 92.768 
(F)(3,2)(2,0)  75.454 / 92.628  (2,0)(3,2)(F)  50.360 / 75.024  (3,2)(3,1)(2,0)  75.336 / 92.554  (2,0)(3,1)(3,2)  48.672 / 73.536 
(F)(2,1)(2,0)  75.380 / 92.438  (2,0)(2,1)(F)  52.724 / 76.764  
In Table 5, the key observation is that it is important to have high precision at the beginning of training. Compared with the case that training starts with fullprecision activations and ends with aggressively reduced precision, (F)(3,2)(2,0), the opposite case, (2,0)(3,2)(F) gives significantly lower accuracy, 75.454% vs. 50.360%. Another important observation is that activation annealing works. For instance, (3,2)(3,1)(3,0) gives almost the same result to (3,2)(3,2)(3,2) in Table 3 and, a more aggressive case, (3,2)(3,1)(2,0) gives only by 0.584% smaller accuracy. Thus, as training advances, we need the smaller amount of large data, which means we can have smaller memory cost of activations. This can be exploited for memory management in servers. We expect it can also be utilized in memoryefficient servermobile cotraining in federated learning [13] where the later stage of training requiring smaller memory cost can be performed on memorylimited mobile devices while meeting the requirements of userspecific adaptation using private data.
Figure 3 shows the training loss of different RVQuant configurations during training. First, the figure shows that too aggressive quantization in the beginning of training, i.e., (2,0)(3,2)(F), does not catch up with the loss of fullprecision training (Full in the figure). The figure also shows that the configuration of 3bit 2% RVQuant gives almost the same loss as the fullprecision training.
5.2 Inference Results
Figure 4 shows the accuracy of quantized models across different configurations of bitwidth and AR. We apply the same bitwidth of low precision to both weights and activations and 16 bits to large values of weights and activations. In addition, we quantize all the layers including the first (quantized weights) and last convolutional layers. As the figure shows, VQuant with finetuning, at 4 bits and an AR of 1%, gives accuracy comparable to full precision in all the networks within 1% of top1 accuracy. If VQuant is applied without finetuning, the larger AR needs to be used to compensate for accuracy drop due to quantization. However, the figure shows that finetuning successfully closes the accuracy gap between VQuant and fullprecision networks.
Figure 5
illustrates the effect of large values on the classification ability. The figure shows the principal component analysis (PCA) results of the last convolutional layer of AlexNet for four classes (four colors). Figure
5 (a) shows the PCA result of fullprecision network. As Figure 5 (b) shows, when the conventional 4bit linear quantization, or 4bit 0% VQuant is applied to weights/activations, it is difficult to successfully classify four groups of data. However, as Figure 5 (c) shows, only a very small amount (0.1%) of large values can improve the situation. As more large values are utilized, the classification ability continues to improve (3% in Figure 5 (d)). The figure demonstrates that our idea of reducing quantization errors for the majority of data by separately handling large data is effective in keeping good representations.5.3 LSTM Language Model
We apply VQuant to an LSTM for wordlevel language modeling [26, 23, 10]. Table 6 shows the results for the models. Each of the large and small models has two layers. The large model has 1,500 hidden units and the small one 200 units. We measure wordlevel perplexity on Penn Tree Bank data [15]. We apply VQuant only to the weights of the models since clipping is applied to the activation.^{3}^{3}3The distribution of activations obtained by clipping tends to have large population near the maximum/minimum values. Considering that clipped activation functions like ReLU6 are useful, it will be interesting to further investigate clippingaware quantization to take into account such large values.
As Table 6 shows, we evaluate three cases of bitwidth, 2, 3 and 4 bits and two ratios of large weights, 1% and 3%. As the table shows, for the large model, the 4bit 1% VQuant preserves the accuracy of the fullprecision model. However, the small model requires the larger ratio of large weights (3%) in order to keep the accuracy.
Large1%  Large3%  Small1%  Small3%  
Valid  Test  Valid  Test  Valid  Test  Valid  Test  
float  75.34  72.31  75.34  72.31  103.64  99.24  103.64  99.24 
2bit  79.92  77.31  77.87  74.99  140.70  135.11  122.25  117.76 
3bit  76.19  73.22  75.79  72.72  107.60  102.82  105.99  101.44 
4bit  75.46  72.48  75.44  72.44  104.22  99.83  103.95  99.57 
6 Conclusions
We presented a novel valueaware quantization to reduce memory cost in training and computation/memory cost in inference. In order to realize aggressively low precision, we proposed separately handling a small amount of large data and applying reduced precision to the majority of small data, which contributes to reducing total quantization errors. In order to apply our idea to training, we proposed quantized backpropagation which utilizes quantized activations only during backpropagation. For inference, we proposed applying finetuning to quantized networks to recover from accuracy loss due to quantization. Our experiments show that our proposed method significantly outperforms the stateoftheart method of lowcost memory in training in deep networks, e.g., 41.6% and 53.7% smaller memory cost in ResNet152 and Inceptionv3, respectively. It also enables 4bit inference (with 1% large data) for deep networks such as ResNet101 and DenseNet121, and 5bit inference for efficient networks such as SqueezeNet1.1 and MobileNetv2 within 1% of additional top1 accuracy loss.
References
 [1] BakunasMilanowski, D., et al.: Efficient algorithms for stream compaction on gpus. International Journal of Networking and Computing (IJNC) 7(2), 208–226 (2017)
 [2] Chen, T., et al.: Training deep nets with sublinear memory cost. arXiv:1604.06174 (2016)
 [3] De Sa, C., et al.: Highaccuracy lowprecision training. arXiv:1803.03383 (2018)
 [4] Ginsburg, B., et al.: NVIDIA Mixed Precision Training on Volta GPUs. GPU Technology Conference (2017)

[5]
Gomez, A.N., et al.: The reversible residual network: Backpropagation without storing activations. In: Advances in Neural Information Processing Systems (NIPS). pp. 2211–2221 (2017)

[6]
Hazelwood, K., et al.: Applied machine learning at facebook: A datacenter infrastructure perspective. International Symposium on HighPerformance Computer Architecture (HPCA) (2018)
 [7] He, X., et al.: Practical lessons from predicting clicks on ads at facebook. In: International Workshop on Data Mining for Online Advertising (ADKDD). pp. 1–9. ACM (2014)
 [8] Hong, S., Kim, H.: An integrated GPU power and performance model. In: International Symposium on Computer Architecture (ISCA). pp. 280–289 (2010)
 [9] Hubara, I., et al.: Quantized neural networks: Training neural networks with low precision weights and activations. arXiv:1609.07061 (2016)
 [10] Inan, H., Khosravi, K., Socher, R.: Tying word vectors and word classifiers: A loss framework for language modeling. arXiv:1611.01462 (2016)
 [11] Jia, Y., Peter, V.: Delivering realtime ai in the palm of your hand. https://code.facebook.com/posts/196146247499076/deliveringrealtimeaiinthepalmofyourhand/, accessed: 2018314

[12]
Jouppi, N.P., et al.: Indatacenter performance analysis of a tensor processing unit. In: International Symposium on Computer Architecture (ISCA). pp. 1–12 (2017)
 [13] Konečnỳ, J., et al.: Federated learning: Strategies for improving communication efficiency. arXiv:1610.05492 (2016)

[14]
Marcel, S., Rodriguez, Y.: Torchvision the machinevision package of torch. ACM Multimedia (2010)
 [15] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19(2), 313–330 (1993)
 [16] Migacz, S.: NVIDIA 8bit inference width TensorRT. GPU Technology Conference (2017)
 [17] Mishra, A., Marr, D.: Apprentice: Using knowledge distillation techniques to improve lowprecision network accuracy. arXiv:1711.05852 (2017)
 [18] Mishra, A., et al.: Wrpn: Wide reducedprecision networks. arXiv:1709.01134 (2017)
 [19] Miyashita, D., Lee, E.H., Murmann, B.: Convolutional neural networks using logarithmic data representation. arXiv:1603.01025 (2016)

[20]
Park, E., Ahn, J., Yoo, S.: Weightedentropybased quantization for deep neural networks. Computer Vision and Pattern Recognition (CVPR) pp. 7197–7205 (2017)
 [21] Paszke, A., et al.: Pytorch (2017)
 [22] Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quantization. International Conference on Learning Representation (ICLR) (2018)
 [23] Press, O., Wolf, L.: Using the output embedding to improve language models. In: the European Chapter of the Association for Computational Linguistics (EACL). pp. 157–163 (2017)

[24]
Rastegari, M., et al.: Xnornet: Imagenet classification using binary convolutional neural networks. the European Conference on Computer Vision (ECCV) pp. 525–542 (2016)
 [25] Umuroglu, Y., et al.: Finn: A framework for fast, scalable binarized neural network inference. In: ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA). pp. 65–74 (2017)
 [26] Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv:1409.2329 (2014)
 [27] Zhou, S., et al.: Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160 (2016)
 [28] Zhou, S., et al.: Balanced quantization: An effective and efficient approach to quantized neural networks. J. Comput. Sci. Technol. 32(4), 667–682 (2017)
 [29] Zhu, C., et al.: Trained ternary quantization. arXiv:1612.01064 (2016)
Comments
There are no comments yet.