Face mask detector based on retinanet


Author | guess
Compile VK
Source: analytics vidhya


Object detection is a very important field in computer vision, which is necessary for automatic driving, video surveillance, medical applications and many other fields.

We are fighting an epidemic of unprecedented scale. Researchers around the world are trying to develop a vaccine or a treatment for cowid-19, while doctors are trying to prevent the epidemic from sweeping the world. On the other hand, many countries have found social alienation, which can be curbed slightly by using masks and gloves.

I recently had an idea to use my deep learning knowledge to help the current situation. In this article, I’m going to introduce you to the implementation of retinanet without much background knowledge.

We will use retinanet to build a “mask detector” to help us cope with this ongoing epidemic. You can infer the same idea to build a solution supporting artificial intelligence for your smart home. This AI solution is only open to people wearing masks and gloves.

As the cost of UAVs decreases over time, we see a big peak in the generation of air data. Therefore, you can use this retinanet model to detect different objects in aerial images or even satellite images, such as cars (bicycles, cars, etc.) or pedestrians, so as to solve different business problems.

So, you see, the application of target detection model is endless.


  1. What is retinanet
  2. Requirements of retinanet
  3. The architecture of retinanet
    1. backbone network
    2. Object classification subnet
    3. Object regression subnet
  4. Focal Loss
  5. Mask detector based on retinanet model
    1. collecting data
    2. Create data set
    3. Model training
    4. Model testing
  6. The final explanation

What is retinanet

Retinanet is the best single target detection model, which has been proved to be able to deal with dense and small-scale objects. For this reason, it has become a popular target detection model.

Requirements of retinanet

Retinanet is introduced by Facebook AI research to solve the problem of intensive detection. When dealing with extreme Foreground Background class, it is necessary to make up for the imbalance and inconsistency of single step target detectors such as Yolo and SSD.

The architecture of retinanet

In essence, retinanet is a composite network composed of the following parts:

  1. Backbone network (bottom-up path and top-down path with horizontal connection)

  2. Target classification subnet

  3. Target regression subnet

For a better understanding, let’s look at each component of the architecture separately

1. Backbone network

  1. Bottom up path: bottom up path (for example, RESNET) is used for feature extraction. Therefore, it computes feature maps of different scales, regardless of the size of the input image.

  2. Top down path with horizontal connection: on the top-down path, the rough feature graph is sampled from a higher pyramid level, and the horizontal connection merges the top-down and bottom-up layers with the same spatial size. Higher level feature maps often have smaller resolution but stronger semantics. Therefore, it is more suitable for detecting larger objects; On the contrary, grid cells from lower level feature maps have high resolution, so they are better at detecting smaller objects. Therefore, combined with the top-down path and its horizontal connection with the bottom-up path, it does not need too much extra calculation, so each level of the generated feature graph can be very strong in semantics and space. Therefore, the architecture is scale invariant and can provide better performance in terms of speed and accuracy.

2. Target classification subnet

Each FPN layer is attached with a full convolution network (FCN) for object classification. As shown in the figure, the subnet contains 3 * 3 convolution layers, 256 filters, and then 3 * 3 convolution layers, K * a filters. Therefore, the output feature map size is w * h * Ka, where W and H are proportional to the width and height of the input feature map, and K and a are the number of object classes and anchor boxes respectively.

Finally, the sigmoid layer (instead of softmax) is used for target classification.

The reason why the last convolution layer has Ka filter is that if each position in the feature map obtained from the last convolution layer has many anchor box candidate regions, then each anchor box may be classified into k classes. So the output feature map size will be Ka channel or filter.

3. Target regression subnet

Regression subnet and classification subnet are attached to each feature graph of FPN in parallel. The design of regression subnet is the same as that of classification subnet, except that the size of the last convolution layer is 3 * 3, there are four filters, and the output characteristic graph size is w * h * 4a.

The reason why the last convolution layer has four filters is that in order to locate class objects, the regression sub network generates four numbers for each anchor box, which predict the relative offset between the anchor box and the real frame anchor box (according to the center coordinates, width and height). Therefore, the output characteristic map of regression subnet has 4A filter or channel.

Focal Loss

Focal loss (FL) is an improved version of cross entropy loss (CE). It attempts to deal with class imbalance by assigning more weights to difficult or easily misclassified examples (i.e., backgrounds with noisy textures or partial objects or objects of interest), and to simplify simple examples (i.e., background objects).

Therefore, “focal loss” reduces the loss contribution of simple examples and increases the importance of correcting misclassified examples. Focus loss is just an extension of cross entropy loss function, which reduces the weight of simple examples and focuses training on difficult samples.

So in order to achieve these goals, the researchers put forward some suggestions

1-pt represents the cross entropy loss, and the adjustable focusing parameter is ≥ 0. Analysis of focus loss in retinanet object detection method α Balanced variants, where α= 0.25, γ= 2. The effect is the best.

So focal loss can be defined as

See figure for γ We will notice the following properties of focal loss

  1. When the sample classification is wrong and Pt is small, the modulation factor is close to 1 and does not affect the loss.

  2. When Pt → 1, the factor becomes 0, and the loss of well classified examples can be well weighed.

  3. Focal loss γ The weight of the simple example is adjusted smoothly. along with γ The effect of modulation factor is also increased with the increase of temperature( After a lot of experiments and experiments, the researchers found that γ= 2 most effective)

notes: when γ= 0, FL is equivalent to CE. The blue curve is shown

Intuitively, the modulation factor reduces the loss contribution of the sample and expands the low loss range of the sample.

Now let’s look at the implementation of retinanet in Python to build a mask detector.

Mask detector based on retinanet model

collecting data

Any deep learning model needs a lot of training data to produce good results on the test data.

Create data set

We started using the labelimg tool to create data sets and validate them. This excellent annotation tool allows you to quickly annotate the bounding box of an object to train the machine learning model.

You can install it from the anaconda command prompt using the following command

pip install labelImg

You can annotate each JPEG file using the labelmg tool below, which will generate an XML file with the coordinates of each bounding box. We will use these XML files to train our model.

model training

Step 1: Clone and installkeras-retinanet
import os
git clone https://github.com/fizyr/keras-retinanet.git
%cd keras-retinanet/
!pip install .
!python setup.py build_ext --inplace
Step 2: import all necessary Libraries
import numpy as np
import shutil
import pandas as pd
import os, sys, random
import xml.etree.ElementTree as ET
import pandas as pd
from os import listdir
from os.path import isfile, join
import matplotlib.pyplot as plt
from PIL import Image
import requests
import urllib
from keras_retinanet.utils.visualization import draw_box, draw_caption , label_color
from keras_retinanet.utils.image import preprocess_image, resize_image
Step 3: import JPEG and XML data


##Read all files
allfiles = [f for f in listdir(annotPath) if isfile(join(annotPath, f))]
#Read all images in PDF files, then store the text in a temporary folder
for file in allfiles:
if (file.split(".")[1]=='xml'):

        tree = ET.parse(annotPath+file)
        root = tree.getroot()
        for obj in root.iter('object'):
            cls_name = obj.find('name').text
            xml_box = obj.find('bndbox')
            xmin = xml_box.find('xmin').text
            ymin = xml_box.find('ymin').text
            xmax = xml_box.find('xmax').text
            ymax = xml_box.find('ymax').text
            #Appending rows in the empty data frame by adding a dictionary
            data = data.append({'fileName': fileName, 'xmin': xmin, 'ymin':ymin,'xmax':xmax,'ymax':ymax,'class':cls_name}, ignore_index=True)

Step 4: write a function to display the bounding box on the training data set
#Random selection of images
  filepath = df.sample()['fileName'].values[0]

  #Gets all rows of this image
  df2 = df[df['fileName'] == filepath]
  im = np.array(Image.open(filepath))

  #If there is PNG, it will have alpha channel
  im = im[:,:,:3]

  for idx, row in df2.iterrows():
    box = [
    draw_box(im, box, color=(255, 0, 0))



#Check a small amount of data records

#Define the label and write it to the file
classes = ['mask','noMask']
with open('../maskDetectorClasses.csv', 'w') as f:
  for i, class_name in enumerate(classes):

if not os.path.exists('snapshots'):

Note: it’s better to start with a pre trained model rather than training the model from scratch. We will use the resnet50 model that has been pre trained on the coco dataset.

URL_MODEL = 'https://github.com/fizyr/keras-retinanet/releases/download/0.5.1/resnet50_coco_best_v2.1.0.h5'
urllib.request.urlretrieve(URL_MODEL, PRETRAINED_MODEL)
Step 5: train the retinanet model

Note: if you use Google colab, you can use the following code snippet to train your model.

#Put your training data path and files on the training data tab
!keras_retinanet/bin/train.py --freeze-backbone \ --random-transform \ --weights {PRETRAINED_MODEL} \ --batch-size 8 \ --steps 500 \ --epochs 15 \ csv maskDetectorData.csv maskDetectorClasses.csv

However, if you are training with a local jupyter notebook or other IDE, you can execute the command at the command prompt

python keras_retinanet/bin/train.py --freeze-backbone 
            --random-transform \
            --weights {PRETRAINED_MODEL} 
            --batch-size 8 
            --steps 500 
            --epochs 15  
            csv maskDetectorData.csv maskDetectorClasses.csv

Let’s analyze each parameter of train. Py

  1. Freeze backbone: freezing backbone layers is especially useful when we use small data sets to avoid over fitting

  2. Random transform: randomly transform datasets for data enhancement

  3. Weights: initialize the model with a pre trained model (your own model or one published by fizyr)

  4. Batch size: training batch size. The higher the value, the smoother the learning curve

  5. Step: the number of iterations

  6. Epochs: number of iterations

  7. CSV: the comment file generated by the script above

Step 6: load training model
from glob import glob
model_paths = glob('snapshots/resnet50_csv_0*.h5')
latest_path = sorted(model_paths)[-1]
print("path:", latest_path)

from keras_retinanet import models

model = models.load_model(latest_path, backbone_name='resnet50')
model = models.convert_model(model)

label_map = {}
for line in open('../maskDetectorClasses.csv'):
  row = line.rstrip().split(',')
  label_map[int(row[1])] = row[0]
Model test:
Step 7: prediction with training model
#Write a function that randomly selects an image from your dataset and predicts using the training model.
def show_image_with_predictions(df, threshold=0.6):
  #Randomly select an image
  row = df.sample()
  filepath = row['fileName'].values[0]
  print("filepath:", filepath)
  #Gets all rows of this image
  df2 = df[df['fileName'] == filepath]
  im = np.array(Image.open(filepath))
  print("im.shape:", im.shape)

  #If there is a PNG, it will have an alpha channel
  im = im[:,:,:3]

  #Draw a real box
  for idx, row in df2.iterrows():
    box = [
    draw_box(im, box, color=(255, 0, 0))

  ###Draw predictions###

  #Get forecasts
  imp = preprocess_image(im)
  imp, scale = resize_image(im)

  boxes, scores, labels = model.predict_on_batch(
    np.expand_dims(imp, axis=0)

  #Standardized frame coordinates
  boxes /= scale

  #Cycle to get each prediction
  for box, score, label in zip(boxes[0], scores[0], labels[0]):
    #Scores are sorted, so once we see scores below the threshold, we can exit
    if score < threshold:

    box = box.astype(np.int32)
    color = label_color(label)
    draw_box(im, box, color=color)

    class_name = label_map
    caption = f"{class_name} {score:.3f}"
    draw_caption(im, box, caption)
    score, label=score, label
  return score, label
plt.rcParams['figure.figsize'] = [20, 10]
#You can change the threshold at will according to your business needs
label=show_image_with_predictions(data, threshold=0.6)

#You can change the threshold at will according to your business needs
label=show_image_with_predictions(data, threshold=0.6)

#You can change the threshold at will according to your business needs
label=show_image_with_predictions(data, threshold=0.6)

#You can change the threshold at will according to your business needs
label=show_image_with_predictions(data, threshold=0.6)

#You can change the threshold at will according to your business needs
label=show_image_with_predictions(data, threshold=0.6)

#You can change the threshold at will according to your business needs
score, label=show_image_with_predictions(data, threshold=0.6)

#You can change the threshold at will according to your business needs
score, label=show_image_with_predictions(data, threshold=0.6)

#You can change the threshold at will according to your business needs
score, label=show_image_with_predictions(data, threshold=0.6)


The final explanation

All in all, we have completed the whole process of making mask detector with retinanet. We created a dataset, trained a model and tested it (this is my notebook and GitHub repository of the dataset)https://github.com/Praveen76/Face-mask-detector-using-RetinaNet-model

Retinanet is a powerful model, which uses feature pyramid Networks & RESNET as its backbone. I was able to get good results with a very limited data set and very few iterations (each iteration has 500 steps, a total of 6 iterations). Of course, you can also change the threshold.

be careful:

  1. Make sure you train your model at least 20 iterations to get good results.

  2. A good idea is to submit a way to build a mask detector using the retinanet model. People can always adjust models, data, and methods to meet business needs.

In general, retinanet is a good choice to start a target detection project, especially if you need to get good results quickly.

Link to the original text:https://www.analyticsvidhya.com/blog/2020/08/how-to-build-a-face-mask-detector-using-retinanet-model/

Welcome to panchuang AI blog:

Sklearn machine learning official Chinese document:

Welcome to pancreato blog Resource Hub: