Abstract:In this era of computing power, on the one hand, our researchers are committed to constantly studying the common networks in different scenes, on the other hand, they are committed to optimizing the learning methods of neural networks, which are trying to reduce the computing power resources needed by AI.

This article is shared from Huawei cloud community “OCR performance optimization series (2): from neural network to plasticine”, original author: hw007.

OCR refers to the recognition of printed characters in pictures. Recently, we are doing the performance optimization of OCR model, rewriting the OCR network based on tensorflow with CUDA C, and finally achieving a 5-fold performance improvement. Through this optimization work, I have a deep understanding of the general network structure and related optimization methods of OCR network. I plan to record it through a series of blog articles here, and also as a summary and learning notes of my recent work. In the first article “series of OCR performance optimization (1): overview of bilstm network structure”, this paper deduces how the OCR network based on seq2seq structure is built step by step from the perspective of motivation. Now we’re going to go from neural networks to plasticine.

## 1. CNN: the essence of machine learning

Now, start from the input in the lower left corner of Figure 1 in the series of OCR performance optimization (1), and string through the flow of Figure 1. The first is to input 27 pieces of text fragment images to be recognized, and the size of each image is 32*132。 These images will be encoded through a CNN network and output 32 27 images*384. As shown in the figure below:

It is worth noting that the dimension order is adjusted in this step, that is, the input is changed from 27*（32*132) to 27*(384), you can understand that the size is 32*The picture of 132 is elongated and flattened into a line (1*4224), and then reduce the dimension to 1*384, similar to the example of reducing 1024 to 128 in optimization strategy 1. How to do this from 27*4224 to 27*How about dimension reduction of 384? The most simple and crude way is to use the above y = ax + B model to give 27*4224 times a 4224*384, matrix A is obtained by feeding data training. Obviously, for the same X, different a get different Y, and the dimension from 4224 to 384 drops a bit. Therefore, scholars have proposed the great method of group fighting. If one a can’t do it, let’s have more than one a. So 32 A’s are offered here. This 32 is the sequence length of LSTM network, which is proportional to the width of input image 132. When the width of the input image becomes 260, 64 A’s are needed.

Maybe someone will ask, can’t you see the 32 A’s in CNN? Indeed, that’s just my abstraction of CNN’s network function. The key point is to let you have an image understanding of the dimension changes in the process of CNN coding and the sequence length of the next layer of LSTM network. We know that the sequence length of LSTM is actually the number of “dimension reducers”. If you are smart, you should find that I even said “dimensionality reduction” wrong, because if you put together the output results of 32 “dimensionality reducers”, it is 32 * 384 = 12288, much larger than 4224. After the data passes through CNN network, the dimension does not decrease, but increases! In fact, the more formal name of this thing is “encoder” or “decoder”, and “dimensionality reducer” is created by me in this article for the sake of image point. Whether it is called cat or MI, I hope you can remember that its essence is the coefficient matrix A.

Now we are still following the idea of 32 A. After CNN network, on the surface, for each text image, the data dimension is from 32 a*132 became 32*384. On the surface, the amount of data has increased, but the amount of information has not. Just as I have introduced OCR at length, the increase of words, the amount of information or the principle of OCR may improve the readability or interest. How can CNN network achieve a lot of nonsense (only increase the amount of data without increasing the amount of information)? This implies a top-level routine in a machine learning model, which is called “parameter sharing”. A simple example, if one day your friend is very happy to tell you that he has a new discovery, “buy an apple to give 5 yuan, buy two apples to give 10 yuan, buy three apples to give 15 yuan…”, I believe you will doubt his IQ, directly say “buy n apples to 5 * n yuan” is not good? The essence of nonsense is to give a lot of redundant data without abstract summary. The essence of no nonsense is to sum up the rules and experience. The so-called machine learning is like this example of buying an apple. You expect to give a large number of examples to the machine simply and rudely, and the machine can sum up the rules and experience. This rule and experience corresponds to the above example of OCR, that is, the network structure and model parameters of the whole model. In the model network, the current mainstream still relies on human intelligence, such as CNN network for image scenes, RNN network for sequence scenes and so on. If the network structure is not selected correctly, the effect of the learned model will not be good.

As in the above analysis, in fact, I use 32 4224*The a-matrix of 384 can fully meet the above requirements of CNN in the dimension of data input and output, but the effect of the trained model will not be good. Because in theory, 32 4224*The parameter of a matrix of 384 is 32*4224*384=51904512。 This means that in the learning process, the model is too free and easy to learn bad. In the above example of buying an apple, people who can state the fact clearly with fewer words have stronger summarizing ability, because in addition to introducing redundancy, more words may also introduce some irrelevant information to interfere with the use of this rule in the future. Suppose that the price of the apple can be stated in 10 words, and you use 20 words, It is very likely that your other 10 words are describing irrelevant information such as rain, time and so on. This is the famous “Occam razor rule”. Don’t complicate simple things! In the field of machine learning, it is mainly used in the design and selection of model structure, “parameter sharing” is one of its common routines, in short, if 32 4224*The a-matrix parameters of 384 are too large, which leads to the model being too free in learning. So we should add restrictions to make the model less free. We must force it to explain clearly about buying Apple in 10 words. The method is as like as two peas, if the 32 A are exactly the same, the number of parameters here is only 4224.*The 384 one is as like as two peas, which will be reduced 32 times at a time. If I feel that 32 times is reduced, it will be a little bit more relaxed. I will relax a little, do not ask the 32 A to be exactly the same, only ask them to look alike.

Here I call this razor method “you don’t know how good the model is without forcing a model.”. Some people may raise an objection. Although I allow you to use 20 words to explain the price of apple, it does not erase the possibility that you have a strong initiative and pursue the ultimate, and use 10 words to explain it clearly. If I as like as two peas in the above, I can only live in Cover with only one A matrix, so I don’t care if I give 32 A or 64 A, so the model should learn exactly the same A. This is true in theory, but it is unrealistic for the present.

To discuss this problem, we have to go back to the essence of machine learning. All models can be abstractly expressed as y = ax, where x is the model input, y is the model output, and a is the model. Note that a in this paragraph is different from a above, which contains both the structure and parameters of the model. The training process of the model is to know X and Y and solve a. The above question is, is a unique solution? First of all, let me take a step back. Suppose we think that there is a law in the real world, that is to say, a exists and is unique in the real world, just like the laws of physics. Can our model capture this Law in a large number of training data?

Take the physical equation of mass energy E = M*C ^ 2 as an example, in this model, the structure of the model is e = M*C ^ 2, the parameter of the model is the speed of light C. This model proposed by Einstein can be said to be the crystallization of human intelligence. If we use the current AI method to solve this problem, it can be divided into two situations, one is strong AI, and the other is weak AI. First of all, the most extreme weak AI method, that is, the current mainstream AI method, most of the artificial intelligence and a small part of machine intelligence. Specifically, human beings find the relationship between E and m according to their own intelligence and satisfy e = M*C ^ 2, and then feed a lot of E and M data to the machine, let the machine learn the parameter C in this model. In this example, the solution of C is unique, and as long as a small amount of M and C is fed to the machine, the solution of C can be obtained. Obviously, the workload of the intelligent part mainly lies in how Einstein eliminated all kinds of messy factors such as time, temperature, humidity and so on, and determined that e is only related to m and satisfies e = M*C^2。 This part of work is called “feature selection” in the field of machine learning, so many machine learning engineers will call themselves “feature engineers”.

On the contrary, the expectation of strong AI is that we should feed the machine a lot of data, such as energy, mass, temperature, volume, time, speed and so on. The machine will tell me that the energy is only related to mass, and their relationship is e = M*C ^ 2, and the constant C is 3.0*10^8(m/s)。 Here, the machine should learn not only the structure of the model, but also the parameters of the model. In order to achieve this effect, the first step is to find a generalized model, which can describe all the model structures in the world after processing, just like rubber clay can be squeezed into various shapes. This eraser is a neural network in the field of AI, so many theoretical AI books or courses like to give you the description ability of neural network at the beginning to prove that it is the eraser in the field of AI. Now that the dough is available, the next step is how to pinch it. That’s the difficulty. Not every problem and scene in life can be perfectly represented by a mathematical model. Even if this layer holds, no one knows what the model looks like before the structure of the model is discovered, How can you let the machine help you make this thing that you don’t know what shape it is? The only way is to feed many examples to the machine and say that the things you pinch out should be able to walk, fly and so on. In fact, there is no unique solution to this problem. The machine can make a bird for you or a cockroach for you. There are two reasons. One is that you can’t feed all the possible examples to the machine. There will always be “black swan” events; The second is that if there are too many examples, the computing power of the machine is also high. This is why neural networks have been put forward for a long time and become popular in recent years.

After the above discussion, I hope you can have a more intuitive understanding of the model structure and model parameters of machine learning at this time. We know that if the model structure is designed by human intelligence, and then the parameters are learned by machine, under the premise of accurate model structure, such as e = m * C ^ 2 above, we only need to feed the machine a little data, and even the model can have a good analytical solution! But after all, there is only one Einstein, so more ordinary people like us feed a large number of examples to the machine, hoping that he can shape the shape that we don’t know, let alone the beautiful property of analytical solution. Students who are familiar with the training process of machine learning should know that the random gradient descent method used in machine learning to knead the rubber paste is to knead it for a short time to see if it can meet your requirements (i.e. the training data). If it can’t meet your requirements, knead it for a short time, and then cycle until it meets your requirements. Thus it can be seen that the power of the machine to pinch the dough is that the things it pinches do not meet your requirements. This proves that the machine is a very “lazy” thing. When he describes the price rule of apple in 20 words, he has no motivation to describe the price rule of apple in 10 words. Therefore, when you don’t know what you want the machine to produce, it’s better not to give the machine too much freedom. In this way, the machine will produce a very complex thing for you. Although it can meet your requirements, it won’t work well, because it violates the razor rule. In the machine learning model, the more the number of parameters, it often means a greater degree of freedom and more complex structure, resulting in “over fitting” phenomenon. Therefore, many classic network results will use some techniques to do “parameter sharing” to achieve the purpose of reducing parameters. For example, convolution is used in CNN network to do parameter sharing, and LSTM introduces a less dramatic production C matrix to achieve parameter sharing.

## LSTM is waiting for you

After the discussion in the above section, I believe you have got some routines of analytical machine learning. Finally, practice with bidirectional LSTM.

As shown in the figure above, first look at the input and output of LSTM network. The most obvious input is 32*384 purple matrix, the output is 32 27*256, where 27*256 is made up of two 27*128, which are output by forward LSTM and reverse LSTM network respectively. For the sake of simplicity, let’s just look at the forward LSTM for the moment. In this case, the input is 32 27*384 matrix, the output is 32 27*128. According to the “dimension reducer” routine analyzed above, 32 384 * 128 matrices are needed here. According to the “parameter sharing” routine, the structure of a real single LSTM unit is shown in the following figure:

It can be seen from the figure that the real LSTM unit is not a simple 384*Instead, the output node h of the last cell in the LSTM cell sequence is pulled down and put together with the input x to form a 27*The input of 512 is multiplied by a 512*512 parameter matrix, and then combined with the control node C of the previous sequence output to process the obtained data, reduce 512 dimension to 128 dimension, and finally get two outputs, one is 27*One is the new control node C of 27 * 128. The new output h and C will be introduced to the next LSTM unit to influence the output of the next LSTM unit.

Here, it can be seen that, due to the existence of matrix C and matrix H, even the 512 in the 32 cells of LSTM sequence*512 as like as two peas, the relationship between input H and X is different, but they are multiplied by the same 512.*512 matrix as like as two peas, so they should be more similar, because they follow a set of rules (the same 512*512 matrix). Here we can see that LSTM combines the output h and input X of the previous unit as the input, and introduces the control matrix C to achieve the goal of sharing parameters and simplifying the model. This kind of network structure also makes the output of the current sequence unit relate to the output of the previous sequence unit, which is suitable for sequence scene modeling, such as OCR, NLP, machine translation and speech recognition.

Here we can also see that although the neural network is a piece of dough, because it can’t feed all the data, the computing power of the machine is not supported, or we want to improve the learning speed of the machine, in the current AI application scenarios, we design the network structure carefully according to the actual application scenarios and our own prior knowledge, And then we give this dough, which is almost made by human beings, to the machine. Therefore, I prefer to call it an era of weak AI. In this era of fair computing power, on the one hand, our researchers are committed to constantly studying common networks in different scenes, such as CNN for images, RNN for sequences, LSTM, Gru, and so on. On the other hand, they are committed to optimizing the learning methods of neural networks, For example, all kinds of optimized variant algorithms based on SDG and training methods of reinforcement learning are trying to reduce the computing resources needed by AI.

I believe that with the efforts of human beings, the era of strong AI will come.

**Click follow to learn about Huawei’s new cloud technology for the first time~**