Using TPU to implement resnet50 in pytorch


By Dr. Vaibhav Kumar
Compile | VK
Source | analytics in diamag

Pytorch has been promoting the development of computer vision and deep learning by providing a large number of powerful tools and technologies.

In the field of computer vision, deep learning based execution needs to process a large number of image data sets, so an accelerated environment is needed to speed up the execution process to achieve acceptable accuracy level.

Pytorch provides this feature through XLA (accelerated linear algebra), a linear algebra compiler that can target a variety of hardware types, including GPUs and TPUs. Pytorch / XLA environment integrates with Google cloud TPU to achieve faster execution speed.

In this paper, we will demonstrate the implementation of a deep convolution neural network resnet50 using TPU in pytorch.

The model will be trained and tested in pytorch / XLA environment to complete the classification task of cifar10 dataset. We will also examine the time spent on 50 epoch training sessions.

Implementation of resnet50 in Python

In order to take advantage of the TPU function, this implementation is implemented in Google cola. First, we need to select the TPU from the hardware accelerators under notebook settings.

After selecting TPU, we will use the following line to verify the environment code:

import os
assert os.environ['COLAB_TPU_ADDR']

If TPU is enabled, it will execute successfully, otherwise it will throw ‘keyerror:’ colab_ TPU_ ADDR’’。 You can also check the TPU by printing the TPU address.

TPU_Path = 'grpc://'+os.environ['COLAB_TPU_ADDR']
print('TPU Address:', TPU_Path)

In the next step, we will install the XLA environment to speed up the execution process. In the last article, we implemented convolutional neural network.

VERSION = "20200516"
!curl -o
!python --version $VERSION

Now we will import all the required libraries here.

from matplotlib import pyplot as plt
import numpy as np
import os
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met
import torch_xla.distributed.parallel_loader as pl
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.utils.utils as xu
import torchvision
from torchvision import datasets, transforms
import time
from google.colab.patches import cv2_imshow
import cv2

After importing the library, we will define and initialize the required parameters.

#Defining parameters
FLAGS = {}
FLAGS['data_dir'] = "/tmp/cifar"
FLAGS['batch_size'] = 128
FLAGS['num_workers'] = 4
FLAGS['learning_rate'] = 0.02
FLAGS['momentum'] = 0.9
FLAGS['num_epochs'] = 50
FLAGS['num_cores'] = 8
FLAGS['log_steps'] = 20
FLAGS['metrics_debug'] = False

In the next step, we will define the resnet50 model.

class BasicBlock(nn.Module):
  expansion = 1

  def __init__(self, in_planes, planes, stride=1):
    super(BasicBlock, self).__init__()
    self.conv1 = nn.Conv2d(
        in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
    self.bn1 = nn.BatchNorm2d(planes)
    self.conv2 = nn.Conv2d(
        planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn2 = nn.BatchNorm2d(planes)

    self.shortcut = nn.Sequential()
    if stride != 1 or in_planes != self.expansion * planes:
      self.shortcut = nn.Sequential(
              self.expansion * planes,
              bias=False), nn.BatchNorm2d(self.expansion * planes))

  def forward(self, x):
    out = F.relu(self.bn1(self.conv1(x)))
    out = self.bn2(self.conv2(out))
    out += self.shortcut(x)
    out = F.relu(out)
    return out

class ResNet(nn.Module):

  def __init__(self, block, num_blocks, num_classes=10):
    super(ResNet, self).__init__()
    self.in_planes = 64

    self.conv1 = nn.Conv2d(
        3, 64, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn1 = nn.BatchNorm2d(64)
    self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
    self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
    self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
    self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
    self.linear = nn.Linear(512 * block.expansion, num_classes)

  def _make_layer(self, block, planes, num_blocks, stride):
    strides = [stride] + [1] * (num_blocks - 1)
    layers = []
    for stride in strides:
      layers.append(block(self.in_planes, planes, stride))
      self.in_planes = planes * block.expansion
    return nn.Sequential(*layers)

  def forward(self, x):
    out = F.relu(self.bn1(self.conv1(x)))
    out = self.layer1(out)
    out = self.layer2(out)
    out = self.layer3(out)
    out = self.layer4(out)
    out = F.avg_pool2d(out, 4)
    out = torch.flatten(out, 1)
    out = self.linear(out)
    return F.log_softmax(out, dim=1)

def ResNet50():
  return ResNet(BasicBlock, [3, 4, 6, 4, 3])

The following code snippet defines the functions that load the cifar10 dataset, prepare the training and test dataset, and the training process and test process.

SERIAL_EXEC = xmp.MpSerialExecutor()
#Model weights are instantiated only once in memory.
WRAPPED_MODEL = xmp.MpModelWrapper(ResNet50())

def train_resnet50():

  def get_dataset():
    norm = transforms.Normalize(
        mean=(0.4914, 0.4822, 0.4465), std=(0.2023, 0.1994, 0.2010))
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
    transform_test = transforms.Compose([
    train_dataset = datasets.CIFAR10(
    test_dataset = datasets.CIFAR10(
    return train_dataset, test_dataset
  #Multiple processes can be avoided by using serial actuators
  #Download the same data.
  train_dataset, test_dataset =

  train_sampler =
  train_loader =
  test_loader =

  #Scale learning rate
  learning_rate = FLAGS['learning_rate'] * xm.xrt_world_size()

  #Get loss functions, optimizers, and models
  device = xm.xla_device()
  model =
  optimizer = optim.SGD(model.parameters(), lr=learning_rate,
                        momentum=FLAGS['momentum'], weight_decay=5e-4)
  loss_fn = nn.NLLLoss()

  def train_loop_fn(loader):
    tracker = xm.RateTracker()
    for x, (data, target) in enumerate(loader):
      output = model(data)
      loss = loss_fn(output, target)
      if x % FLAGS['log_steps'] == 0:
        print('[xla:{}]({}) Loss={:.2f} Time={}'.format(xm.get_ordinal(), x, loss.item(), time.asctime()), flush=True)

  def test_loop_fn(loader):
    total_samples = 0
    correct = 0
    data, pred, target = None, None, None
    for data, target in loader:
      output = model(data)
      pred = output.max(1, keepdim=True)[1]
      correct += pred.eq(target.view_as(pred)).sum().item()
      total_samples += data.size()[0]

    accuracy = 100.0 * correct / total_samples
    print('[xla:{}] Accuracy={:.2f}%'.format(
        xm.get_ordinal(), accuracy), flush=True)
    return accuracy, data, pred, target

  #Cycle of training and assessment
  accuracy = 0.0
  data, pred, target = None, None, None
  for epoch in range(1, FLAGS['num_epochs'] + 1):
    para_loader = pl.ParallelLoader(train_loader, [device])
    xm.master_print("Finished training epoch {}".format(epoch))

    para_loader = pl.ParallelLoader(test_loader, [device])
    accuracy, data, pred, target  = test_loop_fn(para_loader.per_device_loader(device))
    if FLAGS['metrics_debug']:
      xm.master_print(met.metrics_report(), flush=True)

  return accuracy, data, pred, target

Now, we’re going to start training for resnet50. The training will be completed within 50 epochs defined in the parameters. Before the training starts, we will record the training time, after the training, we will print the total time.

start_time = time.time()
#Start training process
def training(rank, flags):
  global FLAGS
  FLAGS = flags
  accuracy, data, pred, target = train_resnet50()
  if rank == 0:
    #Retrieve the tensor on TPU core 0 and draw it.
    plot_results(data.cpu(), pred.cpu(), target.cpu())

xmp.spawn(training, args=(FLAGS,), nprocs=FLAGS['num_cores'],

At the end of the training, we will print the time spent during the training.

Finally, in the training process, we visualize the prediction of the model to the sample test data.

end_time = time.time()
print("Time taken = ", end_time-start_time)

Link to the original text:

Welcome to visit pan Chuang AI blog station:

Sklearn machine learning Chinese official document:

Welcome to pay attention to pan Chuang blog resource collection station: