Pytorch multi card training

Time:2022-1-8

In the previous blog, we used pytoch to manually implement lenet-5. Because only one of the two cards on the machine was used during training, we wanted to use two graphics cards to train our network at the same time. Of course, lenet, a neural network with low layers and less data sets, does not need two cards to train, Here’s just how to call two cards.

Existing methods

There are three ways to find multi card training on the network:

  • nn.DataParallel
  • pytorch-encoding
  • distributedDataparallel

The first method is pytorch’s own multi card training method, but it can also be seen from the name of the method that it is not completely parallel computing, but the data is calculated in parallel on two cards. The saving of the model and the calculation of loss are concentrated on one of several cards, which also leads to the inconsistency of the video memory occupation of the two cards in this method.

The second method is a third-party package developed by others. It solves the problem that loss’s calculation is not parallel. In addition, it also contains many other easy-to-use methods. Its functions are released hereGitHub linkInterested students can go and have a look.

The third method is the most complex of these methods. For this method, each GPU will calculate the derivation of the data assigned to it, and then pass the result to the next GPU. This is different from dataparallel, which gathers all data into one GPU for derivation, calculates loss and updates parameters.

Here I choose the first method for parallel computing

Parallel computing related code

First, you need to check whether there are multiple graphics cards on the machine

USE_MULTI_GPU = True

#Check whether the machine has multiple graphics cards
if USE_MULTI_GPU and torch.cuda.device_count() > 1:
    MULTI_GPU = True
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"
    device_ids = [0, 1]
else:
    MULTI_GPU = False
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

amongos.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"Is to number the GPU in the machine

The next step is to read the model

net = LeNet()
if MULTI_GPU:
    net = nn.DataParallel(net,device_ids=device_ids)
net.to(device)

There are many differences between this card and a single cardnn.DataParallelThis step

Next is the definition of optimizer and scheduler

optimizer=optim.Adam(net.parameters(), lr=1e-3)
scheduler = StepLR(optimizer, step_size=100, gamma=0.1)
if MULTI_GPU:
    optimizer = nn.DataParallel(optimizer, device_ids=device_ids)
    scheduler = nn.DataParallel(scheduler, device_ids=device_ids)

Because the definitions of optimizer and scheduler are changed, they are also different in later calls

For example, read a piece of code of learning rate:

optimizer.state_dict()['param_groups'][0]['lr']

Now it’s

optimizer.module.state_dict()['param_groups'][0]['lr']

Detailed code can be found inMy GitHub warehousenotice

Start training

The training process is the same as that of a single card. Here we show the occupation of two cards

You can see that both cards are occupied, which shows that our code plays a role, but you can also see that there is an obvious difference in the occupation of the two cards. That is, the dataparallel mentioned earlier is only parallel in data, but not parallel in loss calculation and other operations

last

If there are mistakes and suggestions in the article, you can point them out