I happened to see the discussion of my partners: I don’t know the specific content, but it seems to be sayingLogin interface verification codeWhat happened.
Verification code is a common problem in the testing process, so I can’t help thinking:How to identify and process the verification code efficiently in the process of automated testing, so as to improve the testing efficiency?
First, for a simple drag and drop verification code, you can useSelenium drag and drop function, to operate the dragged and dropped element, and set the drag and drop distance and the target element.
The following example shows the use of drag and drop operations inhttp://pythonscraping.com/pag…To prove that the visitor is not a robot.
A message is displayed in the message element in the page, which reads:
Prove you are not a bot, by dragging the square from the blue area to the read area!
Soon, python will run and get the second message:
You are definitely not a bot!
Of course, this is a relatively simple verification code. If it is a strange web page, it may be a little difficult to implement such an operation, especially the verification code of most websites is much more complex than this.
In the actual environment, our common verification code is graphic verification code, so the above operations are not applicable.
To figure out how to make the machine recognize the image verification code, first think about how we, as intelligent humans, recognize the characters in a verification code:
1) Browse the image to find out the position of n characters in the image;
2) Identify the found characters in turn, and finally get the result of the whole verification code.
In fact, if you want the computer to recognize an image verification code, the steps are similar, such as:
1) Take some strategy to split the image containing multiple characters, so that each image contains only one character as far as possible;
2) The obtained sub images are recognized in turn to obtain the result with the highest probability, and the final result is obtained by combining all the results in turn.
The first step is often calledimage segmentation。 If you have a little impression of dynamic programming, it will be easier to understand, because the role of image segmentation is similar, that is, the large problem is divided into several sub problems and solved in turn.
The second step is in a more standard supervised learning problemclassification problem 。 Supervised learning is mainly divided into regression problems and classification problems. The difference is that the results of regression problems are continuous, while the results of classification problems are discrete.
An example of a typical classification problem is that the area of several houses in a city and their corresponding house prices are known, and on this basis, the house prices of the houses with given area are predicted.
Then the known information will be expressed in images like this:
A fitting function given by a machine learning algorithm may be like this:
Although the regression problem in practice often has more dimensions, that is, the area may not be the only factor affecting the prediction of house prices, but there will be more information that can affect it, such as the area where the house is located, the surrounding school, the distance from the nearest police station, the floor, etc. these factors are used as “parameters”, In different degrees, it has an impact on the final result: house prices.
Engineers often need to consider what to use or what to focus onparameterTo get a better result.
For the classification problem, I believe that as long as you look back a little, you will have a clear understanding.
I wonder if you have noticed that for supervised learning, the known data must include both parameters and results, otherwise you can’t make a prediction if you don’t know any price information of the house.
Another big class of problems is called“Unsupervised learning”, we’ll talk about it soon.
The possible difficulties of the two steps will also be discussed later.
For machine learning of traditional algorithms, bothMATLABstillpythonAre very common.
But when it comes to neural networks, becauseTensorFlow，PyTorchWhen open source tools are widely used, python may be a little better.
AnacondaIs an open source Python package manager, which can facilitate our management of multiple virtual environments and simply deal with the problem of version dependency between different packages.
If you are still using Python 2, it is recommended to replace it withPython3, because many packages no longer provide updates to Python 2.
As for other Python packages, they will be listed separately when used later.
Generating a data set of image verification code is not too complicated. The general steps are as follows:
1) Randomly select n characters from the specified character set (such as 0-9, A-Z, or even Chinese characters);
2) Placing the generated character on the background image;
3) Adding noise to the image;
4) Random scaling, rotation, projection and other transformations are performed on the generated characters;
5) Get the verification code.
If it is to generate a verification code as a data set, there are two additional steps:
6) Store the generated verification code;
7) Record the characters corresponding to the generated verification code and write them to the file.
That’s all, but there are many libraries in python that can simply generate an image verification code. I used it herecaptcha。
After installation, a verification code can be generated by using the following code:
The generated verification code is as follows:
Further, using a cycle to generate verification codes in batch and write them into files, you can get images of verification codes one by one, so as to obtain training sets, which will be used for model training.
While generating the verification code, record its character content. A simple idea isUse a list to store and write it to a CSV file using a library such as pandas for future reading.
Using a similar method, we will generate a smaller number of test sets to test the training results after model training:
It is not difficult to see that the image of a verification code still has a lot of noise.
In machine learning, images are often read as“Gray scale image”。 The grayscale image saves the grayscale information of each pixel, represented by 0-255, and reflects the different depths from black to white. Using grayscale images does not mean that color images are bad, but training on color images will require a larger amount of data. After all, there are three more channels R, G and B.
For such a relatively simple problem, I don’t think it’s necessary to use color images. In fact, more and more complex machine learning tasks are carried out on color images. On the other hand, grayscale images are often easy to distinguish the foreground and background of the image, so it may be easier for noise reduction.
For the calculation of gray image from RGB image,
The formula is:
Of course, this is python. There are basically no algorithms that we need to deal with ourselves.
Use the following code to generate an example of a grayscale image:
Here we useMatplotlib LibraryTo display the image“cmap=”gray””The parameter of is used to specify that the library displays the image in the form of gray image, otherwise the whole picture will be displayed green.
The result of this code is as follows:
Through observation, we can see that,Most noise is darker than characters, that is, its gray value is lower； The color of the background is lighter than the characters, that is, its gray value is higher.
“Binarization” of image through threshold(only 0 and 255, i.e. black and white), noise can be removed to a certain extent to facilitate subsequent image segmentation and image classification.
The following code performs this step:
It is worth noting that these two thresholds are only the two values selected for this dataset.
If you use a different data set, you need to choose another way to binarize.
The binarization result is shown in the figure:
As you can see, most of the noise has been removed. Of course, we also lost some information.
For the remaining discrete noise, a processing idea is to traverse the whole image,Detect the gray value of eight points around each pixel, in this Jiugong grid, the points where the number of black pixels is less than a certain threshold are regarded asDiscrete noise removal。
This is an inefficient way of writing, but it’s easy to understand what kind of operation has been done.
If you want, you can optimize it yourself.
The images obtained after processing are as follows:
Next, the same method is used to process the whole data set.
After processing, due to the large amount of data, the subsequent clustering steps will not only take a long time, but also occupy a considerable amount of memory.
If there is a crash due to insufficient memory during this process, the picture needs to be read as a gray image again for processing, which is a little troublesome.
One solution isUsing a library such as pickle, you will gettrain_ Array and test_ Array is stored as a binary file：
The significance of image segmentation for large problem resolution has been mentioned before.
There are 624 possible results for a classification problem with four characters, each character ranging from numbers to uppercase and lowercase letters. For the split image, the classifier only needs to decide which of the 62 possibilities – so that even if it is blind, it can improve a lot of accuracy.
For the image segmentation of image verification code, I have the following ideas:
1）Several pre-designed filters are used to dot multiply the image in turn。
For several results obtained by point multiplication, the image segmentation can be completed by selecting the one with higher response value and relatively reasonable distance.
Explain with images, like this:
The difficulty with this approach is that,How to design the filter and how to select the results generated by the filter.
Due to the large number of characters and the results of multiple transformations, experienced engineers are required to decide what kind of filter to design.
If the number of filters is too large, it will not only increase the amount of calculation, but also increase the noise in the process of selecting the results, so it is difficult to obtain the optimal solution.
It is not easy to design an excellent filter artificially. It may require the efforts of a large team, which makes it difficult to implement such an algorithm. However, such an idea is very desirable, and we will see it again later.
2) Judge by some algorithm. The simplest, for example,Longitudinal cutting: cutting is performed every time a line without a point is detected。
The disadvantage of this method is that,It is difficult to split stuck characters.
If two characters are bonded together, the results can be roughly obtained by means of average cutting, but if more characters are bonded, or even all characters are bonded together, this method may be completely ineffective.
3）Through some kind of clustering algorithm, the result of image segmentation is obtained.
Clustering algorithm is a kind of unsupervised learning, which is used to divide the data set into several classes without labels.
In traditional machine learning methods, common clustering algorithms includeK-Means，EM-GMM，Mean-ShiftWait.
In neural networks, it is often throughEncoder decoder modelTo achieve clustering tasks.
The following is an introduction to the three commonly used clustering algorithms in traditional learning:
- K-means algorithm
For k-means, its objective function is to minimize the total variance sum of its point and center:
Assuming that the cluster center is known, only the allocation method that can minimize the variance from each point to the cluster center needs to be selected:
Then, for each point, each point is assigned to the nearest cluster center as follows:
Similarly, if we know the distance from each point to the nearest cluster center, we can easily get the problem of cluster center.
However, in fact, we can’t directly know these two premises, which is like the question of chicken or egg first.
Fortunately, the problem is solvable in engineering.
Common methods are:
- Several initial clustering centers are randomly selected；
- The distance from each point to different cluster centers is calculated by Euclidean distance to get the cluster to which it belongs;
- For each cluster, recalculate the new cluster center as its average point; Until it converges.
For k-means, the final result depends on the initialized cluster center.
Therefore, if you are unlucky, you may need to use k-means many times to get a more reasonable clustering result.
In engineering, K-means algorithm is often used many times to obtain the clustering center and take the solution that minimizes the value of the objective equation.
Due to the use of Euclidean distance, K-means algorithm will get a circular cluster.
The graphical demonstration of K-means algorithm is as follows:
- Em-gmm algorithm
A multivariate Gaussian mixture model (GMM) can obtain elliptical clustering.
The shape of the ellipse is determined by the covariance matrix of Gaussian distribution, and the position is determined by the mean.
The Gaussian mixture model is a weighted sum of Gaussian distributions:
Each Gaussian distribution represents an elliptical cluster.
In practice, commonExpectation maximization (EM) algorithmTo calculate the GMM.
When there are hidden / unseen variables, EM algorithm is often used to obtain the maximum likelihood estimation solution.
The solution process of EM algorithm for GMM is as follows:
- E-step: calculate the relationship between clusters and assign them to clusters _ j_
Note that the allocation used here is actually a soft allocation, that is, a point can be assigned to different clusters fractional.
two M-step: update the mean, covariance and weight of each Gaussian distribution cluster
- Calculate the points in the cluster (soft):
- Mean value:
Because the circle is actually a special ellipse, when the covariance is equal, the clustering result obtained by em-gmm algorithm should be the same as that of k-means.
Such clustering results can often be obtained by using em-gmm algorithm:
- Mean shift algorithm
The idea of mean shift algorithm is to gradually calculate the gradient of each point in the set and move accordingly, and finally move to the same locally optimal point, that is, it belongs to the same cluster.
In practice, the specific steps are as follows:
1） Start with an initial point X；
-According to the Gaussian kernel, find the nearest neighbor of X within a certain radius;
-Take x as the mean of its adjacent points until x no longer changes.
The graphical demonstration of this step is as follows:
More bluntly, it is to initialize a circle with a fixed radius randomly, and then move its center all the way to the direction with the highest density of the data set.
After several iterations, when it no longer moves, it reaches a cluster center, and the points within the radius of the circle are the points in the cluster. For the mean shift algorithm, its variable is naturally the radius of the circle.
In this problem, an algorithm close to the implementation of mean shift (which is also commonly used) is calledWatershed algorithm。
Imagine a rolling mountain range. At this time, it suddenly rains heavily. The low place between the mountains is naturally filled with water, and the image is segmented. In the implementation, the process of “rain” is essentially the process of gradient descent.
It sounds like clustering algorithm is very suitable to achieve this goal. In fact, for simpler verification codes (in the case of obvious character separation and easy separation), clustering algorithm can classify, but for complex cases, the result may not be so ideal.
The following is the result of clustering the previous verification codes using k-means algorithm:
You can see that the characters C and 9 have been separated, while X and 6 are completely confused.
The reason may be:
Characters do not conform to the circular distribution, or K-means is difficult to get a good initialization point in this case.
The more critical problem is that if we check the position of the abscissa of the cluster center and display the corresponding results from left to right:
It can be seen that result 3 more like “6” appears before result 4 more like “X”. If you label it directly, it can be regarded as “dirty data”.
The results obtained by watershed algorithm are not ideal:
For machine learning of traditional algorithms, solving a more difficult problem often requires a team of engineers with dozens or even hundreds of people to design for months or even years.
In view of such segmentation results, direct image classification must not go on. So next, I will introduce some common classification algorithms, and then let’s see whether the use of neural network can solve this problem.
After image segmentation, there are some techniques such as “normalization” of the image.
In this problem, normalization can be used to correct the image direction vector to a certain extent. A well-designed “oriented bounding box” can be an idea to achieve this effect.
To solve this problem, the main difficulty of classification is that some characters are inherently difficult to distinguish, which will make it difficult for the convergence of the classifier.
Such as 0 and O, I and L – can you really recognize who these two are?!
However, I can’t think of any good solution to this problem, so I can only turn a blind eye.
A common strategy that can be adopted for data sets before actually starting training is called“Data augmentation”。
Generally speaking, data enhancement is often a way to change the data in the data set to expand the data set.
However, be careful about the specific changes taken. For this problem, the simple changes that can be taken mainly includeShearing, rotation, shifting, and zooming。
When rotating, it should be noted that it can only be carried out within a small range. For example, if “7” is rotated 180 degrees, it will become a shape similar to “L”, but its label is still “7”, which is obviously a noise. You can also add some noise, such as“Gaussian noise”To achieve data enhancement and make the model more robust.
The code to realize this function is as follows:
In fact, data enhancement is not necessary for this problem, because it is too easy to obtain a new verification code, and we already have a considerable training set. If enough time is spent to produce a large enough and evenly distributed data set, and the distribution is similar to that of the test set, the performance of the model on the test set will not be too bad.
The data enhancement method used here is often used in the process of generating verification code pictures, so it is not necessary or a means to significantly improve the accuracy. It is mentioned here more as an introduction.
You may have noticed that a code namedPython library for keras。 Keras can basically be understood as a set of APIs that facilitate developers to develop advanced neural networks. It supports tensorflow, theano, etc. to run as the back end. Personally, its syntax is somewhat more friendly than the native tensorflow.
Classification algorithms can be divided into“Two classification algorithm” and “multi classification algorithm”Two. As the name suggests, binary classification algorithm is an algorithm that can only divide data sets into two categories. Common binary classification algorithms includeLogistic regression (LR), support vector machine (SVM)Wait. The most common multi classification algorithm isDecision treeYes.
- logistic regression
For linear functions:
Given the eigenvector, the probability of a class is:
Loss function of logistic regression:
Maximize the possibility of the training set
By regularization Stop fitting — Selecting regularized hyperparameter C through cross validation
For a given sample x *:
- Select the category with the highest probability p (y ｜ x *):
- Otherwise, use f (x *) directly
For binary classifiers such as logistic regression, they can also be used to solve multi classification problems through some means.
One is to use policies such as “1-vs-rest”. In other words, the multi classification problem is divided into several binary classification problems, which are solved by using multiple binary classifiers.
Another method is to define a multi classification objective function and assign a weight vector WC to each category C.
adoptSoftmax functionDefine probability:
- Support vector machine
For a classification problem, there may be several classifiers:
For SVM, its purpose is to find the classifier with the largest margin.
The advantage of SVM decision-making is that it not only ensures that all points are correctly classified, but also ensures the maximum uncertainty.
For a given training set:
First define the edge distance:
Distance from point Xi to superparameter W:
The boundary distance is the shortest of all distances:
The objective function is:
Prediction time, for new data points
It can be seen that the decision boundary generated by SVM is linear.
Can use“Kernel function”So that SVM can make nonlinear judgment.
This is actually a method of transforming the data set into high-dimensional, making the data set linearly separable in high-dimensional.
K-nearest neighbors is a very simple classification algorithm.
Every time there is a new sample, the classifier will find the nearest K of the known samples, and then select most of them as the classification of the new samples.
This is not a good classification method because the classifier has learned nothing.
- Decision tree
Decision tree is also a relatively simple classifier. You can think of it as a flow chart.
For an and operation:
A decision tree is generated as follows:
The disadvantage of using classifier is that it is easy to over fit. Fortunately, this can be compensated by bagging strategy. Because the decision tree itself is a relatively simple classifier (occupying less resources and fast operation speed), it is often combined with boosting and bagging strategies to produce better results.
Boosting: train multiple classifiers, and each new classifier focuses on the wrong part of the previous classifier.
Bagging: train multiple classifiers, and each classifier only uses a random molecular set in the training set for training. In the prediction, all classifiers are used to vote to get the classification result of the new sample. The bagging strategy using decision tree is also called “random forest” classifier.
Verification code recognition using neural network
Two ideas can be used to solve this problem using neural network.
1) Multi classification task
Multi classification tasks, that is, as described in, the final result is only one label, but the label will have several categories.
2) Multi label task
For multi label tasks, a piece of data may have one or more labels.
Imagine you are listening to a song. If you hear rap in it, rap is a label of the song. Similarly, this song can also have labels such as pop, classical… And so on.
As long as you hear the corresponding feature, it may be used as a label of the song to get the probability corresponding to the label.
When training neural network, the relationship between activation function and loss function used in different tasks is as follows:
If you want to know more about the multi tag task, you can refer to the papers of Zhang MINLING and Zhou Zhihua included in 2014 TKDE:
A review on multi-label learning algorithms。
Limited to the relationship between time and space, there will not be too much discussion on the content related to neural network. If it is said, I’m afraid this article will be endless. The following is an implementation of neural network:
Used hereConvolution neural network (CNN)）。 CNN is widely used in computer vision problems. Compared with the traditional DNN (deep neural network), here is MLP,Multi layer perceptron）, it can take into account the “whole” and “part” at the same time, and find the correlation between different regions.
This model contains many common contents of neural networks (which can also be used for traditional machine learning), such as“Optimizer”(Adam should be the most commonly used in recent years, although many big guys don’t like it)“Pooling”(the most common example is maxpooling, which takes the maximum value in a matrix and then reduces the matrix. Through this process, the neural network can “see more widely”)“dropout”(some nodes do not respond and make decisions, making the neural network more robust to solve over fitting), andActivation function(relu, sigmoid, etc…), if you are interested, you can learn more by yourself.
The method of using neural network here can also be said to be the simplest method for using neural network: define a network structure, then plug in the data and see how the results are. If the result is not ideal, adjust the network structure and try again.
This is also a significant difference between using neural networks and using traditional machine learning algorithms:Neural network is driven by data, and human intervention is often only directional input data to correct the network through data； The traditional machine learning algorithm is designed by engineers in advance. If the result of the algorithm is not good enough, it is often just because the designed algorithm does not deal with some cases properly. Therefore, you can also modify the network structure, or use other networks to try.
The above code uses the method of multi label task. You can also modify it to a network of multi classification tasks (you only need to modify the activation function and loss function of the full connection layer, and according to my review, this method is also feasible, and the accuracy can be very high). Let’s see the results.
Of course, not everyone uses this “end-to-end” method when using neural networks, that is, a neural network can solve all problems. You can also preprocess the data input to the network (preprocessing can also be completed by other networks) to see if you can get better results.
Furthermore, the CNN network adopted here is a common solution to the problem of CV (computer vision), and the idea of NLP (natural language processing) can also be adopted to solve the problem of graphic verification code.
This problem can also be solved by using some RNNs (recurrent neural network), such as the popular LSTM in previous years.
The network here actually predicts only 36 categories.
Here is a screenshot of the training process:
It can be seen that in the fifth epoch, the performance of the model in the validation set (10000 training sets, 9000 for training and 1000 for validation set – a data set that does not participate in training and only tests the training results of the model) has begun to decline, indicating that the model has begun to fit to the training set. Of course, this process does not need manual intervention. It can be realized by early stop. As you can see, I stopped training when the accuracy was about 95.81%.
A prediction result of the model is as follows. It can be seen that this is a correct prediction.
The performance of the model on the test set is 0.9390625, that is, about 93.9%.
Of course, verification codes have various forms, and more and more complex verification codes are more and more difficult to break.
For example, the image verification code currently used by some websites requires users to find and select the image with an object in order to give multiple images. The difficulty of this problem is far more than a simple character recognition problem. For such a problem, justThe algorithm is required to recognize a considerable number of objects(for example, hundreds of objects: cars, elephants, flowers, traffic lights, etc.).
To sum up, for the test work, the best method is to discuss with the development brothers and sisters and bypass the verification code, which not only saves the time for developing the whole identification verification code system, but also improves the accuracy of passing. After all, no matter what machine learning method can produce 100% accuracy (assuming that the training set cannot contain all possible cases).
Text / Pokemon fish