TResNet: High Performance GPU-Dedicated Architecture from Ali's Dharma Academy, published in2021 WACV,The paper introduces a series of architectural modifications aimed at improving the accuracy of neural networks while maintaining their GPU training and inference efficiency.
The paper first discusses the bottleneck caused by FLOP-oriented optimization. Designs that make better use of GPU structures are then suggested. Finally, a new GPU-specific model is introduced, called TResNet.
The table above compares ResNet50 to popular newer architectures with similar ImageNet top-1 accuracy – ResNet50-D , ResNeXt50 , SEResNeXt50 (SENet+ResNeXt) , EfficientNet-B1 [ 36] and MixNet-L (MixConv) . Compared to ResNet50, the FLOP reduction and the use of new tricks in the newly proposed network did not translate into an increase in GPU throughput.
Some recent networks such as EfficientNet, ResNeXt and MixNet (MixConv) extensively use depth and 1×1 convolutions, which provide significantly less FLOPs than 3×3 convolutions. But GPUs are usually limited by memory access cost rather than computation count, especially for low FLOP layers. Networks such as ResNeXt and MixNet (MixConv) make extensive use of multipath. For training this creates a large number of activation maps that need to be stored for backpropagation, using a lot of video memory will definitely reduce the batch size and thus GPU throughput.
While TResNet was proposed to achieve high accuracy while maintaining high GPU utilization.
TResNet: Improvements and Changes to ResNet
Contains three variants, TResNet-M, TResNet-L and TResNet-XL, which differ only in depth and number of channels.
ResNet50 stem consists of a stride-2 conv7×7 and a max pooling layer. ResNet-D replaces conv7×7 with three conv3×3 layers. This design does improve accuracy, but at the cost of reduced training throughput. The paper uses a dedicated SpaceToDepth transformation layer  to rearrange the spatial data blocks to depths. The SpaceToDepth layer is followed by a simple convolution to match the number of channels required.
Anti-Alias Downsampling (AA)
The stride-2 convolution is replaced by the stride-1 convolution, followed by a 3×3 blur filter with stride 2.
In-Place Activated BatchNorm (Inplace-ABN)
All BatchNorm+ReLU layers are replaced by Inplace-ABN  layers, which activate BatchNorm as a single inplace operation, thus significantly reducing the memory required to train deep networks with only a slight increase in computational cost. And use Leaky-ReLU to replace the ordinary ReLU of ResNet50.
Novel Block-Type Selection
Bottleneck layers have higher GPU usage than BasicBlock layers and provide better accuracy. But BasicBlock layers have larger receptive fields, so they may be better placed in the early stages of the network. Since BasicBlock layers have large receptive fields, they are placed in the first two stages of the network, while Bottleneck layers are placed in the last two stages. Similar to  and , the initial number of channels and the number of residual blocks in stage 3 are also modified. The architecture details are shown in the table above.
Optimized SE Layers
TResNet BasicBlock and Bottleneck design (stride 1). IBN = Inplace-BatchNorm, r = reduction factor
The SE layer is only placed in the first three stages of the network to gain the maximum speed-accuracy advantage. For Bottleneck units, a SE module is added after the conv3 × 3 operation with a reduction factor of 8 (r = 8). For BasicBlock cells, add the SE block before the residual sum with a reduction factor of 4 (r=4).
In addition to architectural improvements, some of the following code optimizations were made.
JIT compilation dynamically compiles high-level code into efficient, optimized machine code at execution time. This is in contrast to the default Pythonic option of running code dynamically through the interpreter. For the AA and SpaceToDepth modules, JIT compilation was found to reduce GPU cost by almost a factor of two.
Inplace operations directly change the content of a given tensor without copying between memory, which prevents the creation of unwanted activation maps without requiring backpropagation. So try to use the Inplace operation. The maximum batch size of TResNet-M is almost twice that of ResNet50-512,
Fast Global Average Pooling is a simple specialized implementation of GAP, with code optimized for the specific case of (1,1) spatial output, 5x faster than the boilerplate implementation on GPU.
The GPU throughput of TResNet-M is similar to ResNet50, and the validation accuracy on ImageNet is significantly improved (+1.8%). It takes less than 24 hours to train the TResNet-M and ResNet50 models on an 8×V100 GPU machine, which shows that the training scheme is also efficient and economical. Another advantage of the TResNet model is its ability to handle much larger batch sizes than other models.
- While an increase in GPU throughput is expected, the fact that the accuracy is also (slightly) improved when replacing the ResNet stem with a "cheaper" SpaceToDepth unit is somewhat surprising.
- Block type selection provides significant improvements for all methods.
- Inplace-ABN significantly increases the batch size. The impact of Inplace-ABN is mixed: while inference speed increases, training speed decreases.
- The optimized SE + Anti-Aliasing layer significantly improves ImageNet top-1 accuracy at the cost of reduced model GPU throughput.
The impact of code optimizations in ResNet-M models on inference speed, training speed, and maximum batch size
Among optimizations, the inplace operation provides the biggest boost—it not only increases GPU throughput, but also significantly increases batch size, as it avoids creating unneeded activation maps for backpropagation.
An ImageNet pretrained TResNet model of 224 was used as a starting point and fine-tuned to an input resolution of 448 through 10 epochs. The TResNet model scales well to high resolution. Even a relatively small and compact model like TResNet-M achieves 83.2% top-1 accuracy on ImageNet with high-resolution inputs.
Comparison with EfficientNet Model
Along the top-1 accuracy curve, the TResNet model provides a better inference speed accuracy and training speed accuracy trade-off than the EfficientNet model.
Comparison of TResNet with state-of-the-art models on the transfer learning dataset (based on ImageNet transfer learning results only) using ImageNet pre-training and fine-tuning the model for 80 epochs. TResNet surpasses or matches state-of-the-art accuracy on 3 of 4 datasets with 8-15x faster GPU inference.
The TResNet-based solution significantly outperforms the previous top solution on the MSCOCO multi-label dataset, substantially improving the known SOTA from 83.7 mAP to 86.4 mAP.
Using FCOS as the object detector, TResNet-M outperforms ResNet50 on this object detection task, improving the COCO mAP score from 42.8 to 44.0.
[2021 WACV] TResNet: High Performance GPU-Dedicated Architecture
Author: Sik-Ho Tsang