# AI Enabling One-click Automatic Detection: Page Abnormality, Control Abnormality, Text Abnormality

Time：2019-9-27

## 1. Preface

Idle fish quality team has been committed to delivering high-quality apps to users. With the continuous development of AI technology, TensorFlow has become a hot topic, which also brings more possibilities for testing methods. This article will introduce a bit of AI practice in idle fish testing: how to use AI technology to find bugs through pictures.

## 2. Model Selection

There are several kinds of bugs that can be found without understanding the business, such as blank page, abnormal display of some controls and abnormal text. For the whole blank image, we find that their common characteristics are obvious: large area blank or center area error, so we choose a simple CNN model built by TensorFlow to identify normal and abnormal images. For images containing random codes such as text abnormalities, a simple Chinese character recognition model is established by OCR + LSTM to identify the text content in the picture and then judge whether there is random codes.
The training samples of the above models are from the bug history screenshots and the positive data samples of mock.

## 3. Model Re-training: Improving the Accuracy of Model Recognition

Initial model has limited samples in training, but as app keeps updating iteration, the number of image detection samples increases gradually, and some new pages will be misclassified. To solve this kind of false alarm problem, it is urgent to add model retraining.
Obviously, the cost of retraining and replacing the old model by human flesh startup model is too high, so a checked picture entrance to retraining is realized in the front end. Through Jenkins timing task, all retraining pictures are read and retraining scripts are executed, and the old model is replaced by the newly generated one. After several rounds of automatic iteration, the accuracy of model recognition has been greatly improved.

## 4.1 Special screenshot

Some screenshots may have large gaps, but from a business point of view, such images are correct, such as searching the middle pages. If this kind of image is not handled, it will be recognized as abnormal image report every time, which wastes the time of checking. If the model is re-trained, there is a risk that the model will not converge. In order to solve this kind of picture, a library is maintained, and the image identified as abnormal will be compared with the picture in the library. If the similarity with any picture in the library exceeds the set threshold, it is considered that the picture can be ignored and not reported.

## 4.2 Picture Reduplication

At present, in order to ensure that all elements on the page are retrieved, a single traversal task will visit the same page at least twice; at the same time, in order to facilitate page context analysis, click elements will be marked with red boxes. This poses a problem: in the image set to be identified, there will be multiple duplicate screenshots on the same page, and there may be red boxes in different places on the same page. Manual inspection of a large number of repetitive image recognition results inevitably leads to visual fatigue, so displaying the results after de-duplication can greatly improve the efficiency of manual screening and reduce the cost.

### 4.2.1 Solutions

Hierarchical clustering algorithm can be used to solve this problem when the number of pictures is large and the number of different pages is uncertain. The bottom-up clustering method is adopted in this paper. First, each screenshot is regarded as a cluster, and then two clusters with the smallest distance are merged to repeat to the expected cluster or satisfy other termination conditions.

### 4.2.2 Implementation [2]

#### 1) Calculate the distance between pictures

First convert the picture to whIn 3-D vector, the Euclidean distance between vectors is taken as the distance between pictures. The more similar the pictures are, the smaller the distance is.

def get_pic_array(url,w,h):
IMG = Image. open (file) # PIL Opens Pictures
img=img.resize((w, h))
try:
R, g, b, k = img. split () RGB channel separation, compatible with 4 channel conditions
except ValueError:
r, g, b = img.split()
# Get a one-dimensional array of length (w*h)
r_arr = np.array(r).reshape(w * h)
g_arr = np.array(g).reshape(w * h)
b_arr = np.array(b).reshape(w * h)
# The RGB three one-dimensional arrays (w*h) are spliced into one-dimensional arrays (w*h*3)
image_arr = np.concatenate((r_arr, g_arr, b_arr))
return image_arr

In order to complete clustering of N images obtained by an app traversal, a single image is first processed according to the above, and then the whole image is stitched together into n.(wThe matrix of h*3 is used as the sample set.

#### 2) The Method of Calculating the Distance between Clusters

Single: The distance between the nearest two samples in two clusters is taken as the distance between the clusters.
Complete: The distance between the two samples farthest from each cluster is the distance between the two clusters.
Average: the average value of the distance between two samples in two clusters determines the impact of individual anomalous samples on the results, but the calculation is relatively large.
Ward: the sum of the squares of the deviation, the calculation formula is complex. To understand the specific calculation formula and other calculation methods, please refer to the method of calculating the distance between clusters.
After trying, we find that the effect of ward is better, so we finally choose ward as the method of calculating the distance between clusters.

Z = linkage(X, 'ward')

After executing the above statement, the clustering is completed.

#### 3) Selection of Critical Distance

This value directly affects the effect of clustering. If the critical distance is too small, some similar images can not be clustered into one group. If the critical distance is too large, the images that are not on the same page will be clustered together. Therefore, how to choose an appropriate distance is very important.
Experiments show that if the images are recognized as abnormal images by page abnormal model, the similarity between these images is often higher. In order to cluster different abnormal pages correctly, the abnormal and normal images are clustered separately, and the critical distance of abnormal classes is set a little smaller.

## 5. Summary and Prospect

At present, the tool has a good recognition effect on the whole page anomaly, and the accuracy of text anomaly recognition is also increasing in the process of enriching samples.

Next, we will integrate the LabelImg tool and build an SSD model with TensorFlow to identify the abnormal picture of the control. In addition, problems such as disordered element/text layout, page recognition and expected results of page operation are also being tried. Using image processing and error recognition technology as a quality assurance method, we will continue to explore.

### Reference documents:

[1] Image clustering: https://haojunsui.github.io/2016/07/16/scipy-hac/

Author of this paper: Leisure Fish Technology – Zhenlei