Using Python to complete the image recognition method of kaggle cat and dog


Kaggle is a platform for developers and data scientists to hold machine learning competitions, host databases, write and share code. There are many good projects and resources for machine learning and deep learning enthusiasts.

It happened that I recently started a very deep learning framework: Python, so today I will use Python to implement an introductory project in the field of image recognition: cat and dog image recognition.

The foundation of deep learning is data. Let’s start with data. This time, there are 25000 cat and dog classification images, and 12500 cat and dog classification images. Let’s first take a simple look at the pictures.

We can see from the download file that there are two folders: train and test, which are used for training and testing respectively. Take train as an example. When you open the folder, you can see a lot of pictures of kittens. The names of the pictures range from 0.jpg to 9999.jpg. There are 10000 pictures for training.

There were only 2500 kittens in the test. If you look at the kittens carefully, you can see that they have different poses. Some stand, some squint, and some even mix with other recognizable objects, such as buckets and people.

At the same time, the size of the kittens’ pictures is not consistent, some are vertical rectangles, some are horizontal rectangles, but we need to be a reasonable size square. The picture of the dog is similar. I won’t repeat it here.

Next, we will learn about the convolution neural network, which is especially suitable for image recognition. Students who have studied neural networks may have heard of convolutional neural networks more or less. This is a typical multilayer neural network, which is good at dealing with machine learning problems related to images, especially large images.

Through a series of methods, convolutional neural network successfully reduces the dimension of image recognition problem with large amount of data, and finally enables it to be trained. CNN was first proposed by Yann Lecun and applied to handwriting recognition.

A typical CNN network architecture is as follows:

This is a typical CNN architecture, which is composed of volume base layer, pooling layer and full connection layer. The volume base and pool layer cooperate to form a number of convolution groups, extract features layer by layer, and finally complete the classification.

Don’t be afraid if you are a little confused by the above series of terms, because these complex and abstract technologies have been implemented in Python one by one. All we have to do is to call the relevant functions correctly,

After I paste the code, I will make a more detailed and easy to understand explanation.

import os
import shutil
import torch
import collections
from torchvision import transforms,datasets
from __future__ import print_function, division
import os
import torch
import pylab
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from skimage import io, transform
import numpy as np
import matplotlib.pyplot as plt
from import Dataset, DataLoader
from torchvision import transforms, utils
# Ignore warnings
import warnings
plt.ion() # interactive mode

A normal CNN project needs a lot of libraries.

import math
from PIL import Image
class Resize(object):
 """Resize the input PIL Image to the given size.
 size (sequence or int): Desired output size. If size is a sequence like
  (h, w), output size will be matched to this. If size is an int,
  smaller edge of the image will be matched to this number.
  i.e, if height > width, then image will be rescaled to
  (size * height / width, size)
 interpolation (int, optional): Desired interpolation. Default is
 def __init__(self, size, interpolation=Image.BILINEAR):
 # assert isinstance(size, int) or (isinstance(size, collections.Iterable) and len(size) == 2)
 self.size = size
 self.interpolation = interpolation
 def __call__(self, img):
 w,h = img.size
 min_edge = min(img.size)
 rate = min_edge / self.size
 new_w = math.ceil(w / rate)
 new_h = math.ceil(h / rate)
 return img.resize((new_w,new_h))

This library, called size, is used to zoom images. It doesn’t need to be customized transforms.Resize This function has been implemented, but due to unknown reasons, my library does not provide this function, so I need to implement it myself to replace it transforms.Resize .

If you already have this restore function in your torch, you don’t need to do it like me.

data_transform = transforms.Compose([
 transforms.Normalize(mean = [0.5,0.5,0.5],std = [0.5,0.5,0.5])
train_dataset = datasets.ImageFolder(root = 'train/',transform = data_transform)
train_loader =,batch_size = 4,shuffle = True,num_workers = 4)
test_dataset = datasets.ImageFolder(root = 'test/',transform = data_transform)
test_loader =,batch_size = 4,shuffle = True,num_workers = 4)

Transforms is an operation library that provides data (here refers to the image) conversion, and size is the class provided by the upper code. It is mainly used to scale an image to a certain size. Here, we tentatively set the requirement to scale the image to 84 X 84 this level, this is the parameter for adjustment, you can try to modify this parameter after deploying the project, for example, change to 200 x 200, you will find that you can play a game~_ ~。

Centercrop is used to cut the image from the center. The target is a square with length and width of 84, which is convenient for subsequent calculation.

Totenser () is more important. The purpose of this function is to read the image pixels and convert them into 0-1 numbers.

Normalize, as the bottom step, is also a key step. It is mainly used to convert the values of image data sets into data sets with both standard deviation and mean value of 0.5, so that the data value changes from 0 to 1 to – 1 to 1.

class Net(nn.Module):
 def __init__(self):
 self.conv1 = nn.Conv2d(3,6,5)
 self.pool = nn.MaxPool2d(2,2)
 self.conv2 = nn.Conv2d(6,16,5)
 self.fc1 = nn.Linear(16 * 18 * 18,800)
 self.fc2 = nn.Linear(800,120)
 self.fc3 = nn.Linear(120,2)
 def forward(self,x):
 x = self.pool(F.relu(self.conv1(x)))
 x = self.pool(F.relu(self.conv2(x)))
 x = x.view(-1,16 * 18 * 18)
 x = F.relu(self.fc1(x))
 x = F.relu(self.fc2(x))
 x = self.fc3(x)
 return x
net = Net()

Well, the most complicated step is here. Here, we first define a net class, which encapsulates all the training steps, including convolution, pooling, activation and full join operation.

__ init__ The function first defines all the functions that are required, and these functions are called in forward. Let’s start with conv1. Conv1 actually defines a convolution layer. What do 3, 6 and 5 mean?

3 represents the number of layers of the pixel array of the input image. Generally speaking, it is the number of channels of the image you input. For example, the kitten images used here are all color images, which are composed of R, G and B channels, so the value is 3; 6 represents that we want to convolute for six times, and each convolution can generate a different feature mapping array, which is used to extract six characteristics of kittens and dogs Sign.

Each feature mapping result will eventually be stacked together to form an image output, which will be used as the input of the next step; 5 is the size of the filter frame, which means that we want to use a 5 * 5 matrix to dot multiply and add with the matrix of the same size in the image to form a value.

After defining the volume base layer, let’s define the pooling layer. What pooling layer does is simple. In fact, the pixel matrix generated by large image is too large. We need to use a reasonable method to reduce dimension without losing object features. So deep learning scholars have come up with a technology called pooling. To put it bluntly, starting from the upper left corner, every four elements (2* 2) Merge into one element, and use this element to represent the values of the four elements, so the image volume is reduced to a quarter of the original.

On the next line, we come across a volume base layer again: conv2. Like conv1, its input is also a multi-layer pixel array, and its output is also a multi-layer pixel array. The difference is that the amount of calculation completed this time is larger. Let’s see that the parameters in conv2 are 6, 16 and 5 respectively.

The reason why it is 6 is that the output level of conv1 is 6, so the input level here is 6; 16 represents the output level of conv2. Like conv1, 16 represents that this convolution operation will learn 16 mapping features of kittens and puppies. The more features you have, the better the effect you can learn in theory. You can try other values to see if the effect is really good.

The size of the filter box used by conv2 is the same as that of conv1, so it will not be repeated. The last three lines of code are used to define the fully connected network. Those who have been in contact with neural network should not be unfamiliar with it. The main thing is to explain FC1.

I didn’t understand this line before. Why is it 16 * 18 * 18? 16 is very easy to understand, because the height of the image matrix generated by the last convolution is 16 layers. How did 18 * 18 come from? Let’s go back to a line of code


In this line of code, we cut the training image into a square size of 84 * 84, so the earliest input of the image is an array of 3 * 84 * 84. After the first 5 * 5 convolution, we can conclude that the convolution result is a 6 * 80 * 80 matrix. Here 80 is because we use a 5 * 5 filter box. When it convolutes from the first element in the upper left corner, the center of the filter box is from 2 to 78, not from 0 to 79, so the result is an 80 * 80 image.

After a pooling layer, the width and height of the image size are reduced to 1 / 2 of the original, so it becomes 40 * 40.

Next, another convolution was performed. As in the previous convolution, the length and width were reduced by 4 to 36 * 36. Then the pooling of the last layer was applied, and the final size was 18 * 18.

Therefore, the size of the input data of the first fully connected layer is 16 * 18 * 18. What the three fully connected layers do is very similar, that is, they continuously train and finally output a binary value.

Net class forward function represents the whole process of forward calculation. Forward accepts an input and returns a network output value. The intermediate process is to call the layer defined in the init function.

F. Relu is an activation function that converts all non-zero values into zero values. The last key step of this image recognition is the real cycle training operation.

import torch.optim as optim
cirterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(),lr = 0.0001,momentum = 0.9)
for epoch in range(3):
 running_loss = 0.0
 for i,data in enumerate(train_loader,0):
 inputs,labels = data
 inputs,labels = Variable(inputs),Variable(labels)
 outputs = net(inputs)
 loss = cirterion(outputs,labels)
 running_loss +=[0]
 if i % 2000 == 1999:
  print('[%d %5d] loss: %.3f' % (epoch + 1,i + 1,running_loss / 2000))
  running_loss = 0.0
print('finished training!')

[1 2000] loss: 0.691
[1 4000] loss: 0.687
[2 2000] loss: 0.671
[2 4000] loss: 0.657
[3 2000] loss: 0.628
[3 4000] loss: 0.626
finished training!

Here we have three training sessions, and each session is batch training_ In the loader, training data, gradient clearing, output value calculation, error calculation, back propagation and model correction are used. We take the average error of every 2000 calculations as the observed value. We can see that every training, the error value is getting smaller and smaller, and gradually learn how to classify the image. The relativity of code is easy to understand, so I won’t repeat it here.

correct = 0
total = 0
for data in test_loader:
 images,labels = data
 outputs = net(Variable(images))
 _,predicted = torch.max(,1)
 total += labels.size(0)
 correct += (predicted == labels).sum()
print('Accuracy of the network on the 5000 test images: %d %%' % (100 * correct / total))

Finally came to the model accuracy verification, which is also the purpose of the test folder mentioned at the beginning. At this point, net is a trained neural network. Pass in an image matrix, it will output the corresponding classification value, we get the classification value and the real value to do a comparison calculation, we can get the accuracy. The current accuracy rate is 66% on my computer. It may be different on your machine, but it won’t be too different.

Finally, let’s make a summary. In fact, it is not complicated to implement CNN in Python. The theoretical underlying layer has been encapsulated. We only need to call the correct function. The parameters of the current model have not reached a relatively perfect state, interested partners can adjust the parameters to run several times, the training results will be better and better without accident.

In addition, since there is no point in explaining CNN and pasting the project code in an article, I don’t have two things to do at the same time, because there are many good articles explaining CNN on the Internet. If you read the code, you can search the articles about CNN first, and then look back at the project code, which should be clearer.

For the first time, I wrote an article about my own neural network. If there is something wrong with it, please forgive me.

The above use of Python to complete the dog and cat image recognition is the whole content shared by Xiaobian. I hope it can give you a reference, and I hope you can support developer more.