In the last article, we solved both the environmental and network problems. In this article, we try to use AI to make an interesting automatic mask application while covid-19 is still rampant all over the world. Opencv + CNN is mainly used to extract the coordinates of key points on the face, and then the mask image is pasted as a mask. Remember to go out on the national day and the Mid Autumn Festival, and don’t forget to wear a mask to protect yourself.
There are three stages in the whole process
- Find a face on the image
- Detect key points on the face
- Cover the nose and mouth with a mask image
First of all, we need to locate the position of the face on an image, and the DNN module in opencv can do it easily. The detection model is trained in Caffe framework. We get a network definition file face_ detector.prototxt And a weight file face_ detector.caffemodel 。
# defining prototxt and caffemodel paths #Face detection model detector_model = args.detector_model detector_weights = args.detector_weights # load model detector = cv2.dnn.readNetFromCaffe(detector_model, detector_weights) capture = cv2.VideoCapture(0) while True: # capture frame-by-frame success, frame = capture.read() # get frame's height and width height, width = frame.shape[:2] # 640×480 # resize and subtract BGR mean values, since Caffe uses BGR images for input blob = cv2.dnn.blobFromImage( frame, scalefactor=1.0, size=(300, 300), mean=(104.0, 177.0, 123.0), ) # passing blob through the network to detect a face detector.setInput(blob) # detector output format: # [image_id, class, confidence, left, bottom, right, top] face_detections = detector.forward()
After reasoning, we can get the face number, type (face), confidence, left, bottom, right, and upper coordinates.
# loop over the detections for i in range(0, face_detections.shape): # extract confidence confidence = face_detections[0, 0, i, 2] # filter detections by confidence greater than the minimum threshold if confidence > 0.5: # get coordinates of the bounding box box = face_detections[0, 0, i, 3:7] * np.array( [width, height, width, height], )
We pick out all the detected faces, filter out the targets whose confidence level is less than 50%, and get the coordinates of the target frame of the face.
The input of face detection model is (300300). When converting images, it should be noted that the input format of Caffe is BGR mode.
Get face key points
First, we have obtained the target frames of all faces. Next, as the input of the face key point detection model, we can obtain the key point positions such as eyes, eyebrows, nose, mouth, chin and face contour.
1. Model overview
In this case, the high resolution of hrnet is not only the high resolution, but also the low resolution. China University of science and technology and Microsoft Asia Research Institute released a new human posture estimation model, which broke three coco records in that year, and won the CVPR 2019.
It starts from a high-resolution sub network, and slowly adds a sub network of high-resolution to low-resolution. In particular, it does not rely on a single, low to high-level upsampling step to roughly aggregate the low-level and high-level representations together; instead, it continuously fuses the representations of different scales in the whole process.
The team used exchange units to shuttle between different sub networks: each sub network could obtain information from the representations produced by other sub networks. In this way, we can get rich high-resolution characterization.
For more details, please refer to the open source project
2. Cut out the face picture
Pay attention to the way of adjusting the face size here. Because the output target frame of the face detection model may be too close to the face, we can not directly use the precise coordinates of the detection frame to crop the image. Here, the face is magnified by 1.5 times, and then the 256 * 256 image is cut out relative to the central position.
(x1, y1, x2, y2) = box.astype("int") # crop to detection and resize resized = crop( frame, torch.Tensor([x1 + (x2 - x1) / 2, y1 + (y2 - y1) / 2]), 1.5, tuple(input_size), )
3. Preprocessing hrnet input image
Transform the image format and normalize the preprocessing
# convert from BGR to RGB since HRNet expects RGB format resized = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB) img = resized.astype(np.float32) / 255.0 # normalize landmark net input normalized_img = (img - mean) / std
Note that the input image of hrnet is in RGB format.
4. Model construction
# init landmark model #Face key point model model = models.get_face_alignment_net(config) # get input size from the config input_size = config.MODEL.IMAGE_SIZE # load model state_dict = torch.load(args.landmark_model, map_location=device) # remove `module.` prefix from the pre-trained weights new_state_dict = OrderedDict() for key, value in state_dict.items(): name = key[7:] new_state_dict[name] = value # load weights without the prefix model.load_state_dict(new_state_dict) # run model on device model = model.to(device)
5. Model reasoning
Input the preprocessed images into the hrnet network, get 68 face tag data, and then call decode_ Preds function is used to restore the previous clipping and scaling of the image to get the coordinates of key points based on the original image.
# predict face landmarks model = model.eval() with torch.no_grad(): input = torch.Tensor(normalized_img.transpose([2, 0, 1])) input = input.to(device) output = model(input.unsqueeze(0)) score_map = output.data.cpu() preds = decode_preds( score_map, [torch.Tensor([x1 + (x2 - x1) / 2, y1 + (y2 - y1) / 2])], [1.5], score_map.shape[2:4], )
Mask picture binding
Now we have the key point information of the face. Since the mask is usually covered from the lower part of the nose to the upper part of the chin, we choose the coordinates from 2 to 16.
1. Label the mask picture
In order to better align the picture, we need to label the mask image. Here you can use makense, an open source online annotation tool. Easy to use.
Finally, save the annotation in the format of CSV.
2. Read the coordinates of key points
The key points 2-16 and 30 are selected here, and the starting coordinates of the key points are 0.
# get chosen landmarks 2-16, 30 as destination points # note that landmarks numbering starts from 0 dst_pts = np.array( [ landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, landmarks, ], dtype="float32", ) # load mask annotations from csv file to source points mask_annotation = os.path.splitext(os.path.basename(args.mask_image)) mask_annotation = os.path.join( os.path.dirname(args.mask_image), mask_annotation + ".csv", ) with open(mask_annotation) as csv_file: csv_reader = csv.reader(csv_file, delimiter=",") src_pts =  for i, row in enumerate(csv_reader): # skip head or empty line if it's there try: src_pts.append(np.array([float(row), float(row)])) except ValueError: continue src_pts = np.array(src_pts, dtype="float32")
3. Bind key point coordinates
dst_ PTS is the detected face coordinates; Src_ PTS is the label coordinates of mask picture.
# overlay with a mask only if all landmarks have positive coordinates: if (landmarks > 0).all(): # load mask image mask_img = cv2.imread(args.mask_image, cv2.IMREAD_UNCHANGED) mask_img = mask_img.astype(np.float32) mask_img = mask_img / 255.0 # get the perspective transformation matrix M, _ = cv2.findHomography(src_pts, dst_pts) # transformed masked image transformed_mask = cv2.warpPerspective( mask_img, M, (result.shape, result.shape), None, cv2.INTER_LINEAR, cv2.BORDER_CONSTANT, ) # mask overlay alpha_mask = transformed_mask[:, :, 3] alpha_image = 1.0 - alpha_mask for c in range(0, 3): result[:, :, c] = ( alpha_mask * transformed_mask[:, :, c] + alpha_image * result[:, :, c] ) # display the resulting frame cv2.imshow("image with mask overlay", result)
In this paper, the findhomography function in opencv library is used to find the transformation between matching key points, and then the transformation matrix found is applied together with the warppopective function to map these points.
Get a new image transformed with the same size as the original image result_ Mask, using PNG format to define the transparent channel alpha_ Mask, you can merge the two images together.
4. Installation dependency
First install the YACs dependency to read the network structure definition file.
pip install yacs
5. Run the model on the raspberry pie
python overlay_with_mask.py --cfg experiments/300w/face_alignment_300w_hrnet_w18.yaml --landmark_model HR18-300W.pth --mask_image masks/anti_covid.png --device cpu
The parameters are as follows:
CFG: hrnet network configuration file
Landmark: hrnet weight file
mask_ Image: Mask picture
Device: inference device
The speed is relatively slow, but it can still make normal reasoning, which shows that our previous Python environment has no problem. Raspberry pie is full load. It’s going to smoke..
There is a bug in openvino, which is mainly caused by the portability of 32-bit OS. Replace nGraph’s i64 to i32 for size_ t. Modifying the source code of ngraph and recompiling it can solve the problem, or switch to the version of OpenCV 4.4 that we compiled ourselves.
6. Run the model with GPU
python overlay_with_mask.py --cfg experiments/300w/face_alignment_300w_hrnet_w18.yaml --landmark_model HR18-300W.pth --mask_image masks/anti_covid.png --device cuda
Reasoning speed a little bit faster, notebook graphics card GTX 1060, basically can play. In addition, there are many ways to optimize the operation speed of the mask, such as cutting and so on.
Source code download
This one is covered by a mask,
The next one,
We’re going to use AI as well,
The brain replenishes your beauty of prosperity,