The history of deep learning classification network

Copyright notice: This article is originally created by Colin CAI. Please post. If you want to post, you must indicate the original website address 

  Author: window

  QQ / wechat: 6679072

  E-mail:[email protected]

  Artificial intelligence (AI) is one of the most cutting-edge fields of it, and deep learning is the hottest direction in AI. Deep learning refers to the deep neural network, which is mainly because the network can show broader meaning after it is deep, and the basic problem of neural network is classification. This paper starts with neural network, and introduces the development history of deep learning classification network and the technology used in it.

Artificial intelligence has been studied for a long time. It was once divided into three schools: Semiotics, behaviorists and connectives. Semiotics believe that symbolic deduction is the essence of AI, and LISP / Prolog are masterpieces of semiotics; while behaviorists are more inclined to analyze behavior, such as genetic evolution model, artificial ant colony, particle swarm optimization, reinforcement learning (artificial ant colony is actually a special case of reinforcement learning), etc. Since this paper talks about deep learning, it is naturally the research result of connectivism. At present, it is indeed the best development of the connectives, but the AI in the future is doubted to be the confluence of the three universities.


  neural network


  At the beginning of last century, people discovered and studied brain neurons,


   We know that neurons use dendrite signals to transmit information and form the following network environment.



  The history of neural network is very early. As early as 1943, McCulloch and Pitts proposed the concept of neural network according to the biological reality of brain neurons, which is the beginning of bionics.

The single neuron of neural network is regarded as a function of several inputs. All the single input and output are scalar. Considering the generality and the convenience of calculation, the mainstream only considers the neurons of the following models:



  Here sum is the weighted sum of each input, and F is called the activation function. This function is equivalent to using each weight to represent the participation degree of each input signal, while B is a bias added by nature. Friends who know some neural networks will immediately understand that the training link of neural networks is actually to train the weights and bias of each neuron.


  And the excitation function here is also very important. The former weighting value is linear. If all neurons are linear, then the gradual change of all signals will only lead to the gradual change of the final output signal. This smooth gradual change does not meet our actual needs. What we generally need is not this kind of non gradual thing, but more needs the abrupt things, such as such a transition function


  However, there is a kind of discontinuity here, and continuity is very important for neural networks (which will be explained later). The first type of discontinuity and the second type of discontinuity are often disasters. Therefore, we usually use other functions to simulate “transients” such as transition functions. For example, the sigmoid function and the tanh function have images similar to the following:



  The width of intermediate transition zone can be adjusted by weight.


The neural network is composed of multiple neurons, and the following is a single-layer neural network

  There are single layers, naturally there are multiple layers, and the following are multi layers:


   Even a variety of classical neural networks


   Learning of neural network


Since neural networks are to be used, we have to learn. Here, we only consider supervised learning.

The so-called supervised learning is like “cramming teaching”, the purpose of which is to draw inferences from one instance. As a supervisor, he prepared n questions in advance and gave n standard answers. He taught the n questions and N standard answers to the AI model (as for how to teach at will), and then stopped teaching. I hope that the AI model can solve problems other than the n problems.

In order to be learnable, neural network defines the distance between its reasoning result and standard answer for each learning problem, which is called loss, which is quite different from other AI models (such as KNN, DT, etc.).

The MSE loss function is commonly used, which is used from the least square method of curve fitting. It is the average of the square value of the difference between the actual reasoning result of neural network and the standard answer.


1 / 2n is the average, because it is a constant for a specific network, so it can be ignored.

In addition, cross entropy is also very common,


The purpose of our training is to make the loss as small as possible, so that the inference result is closer to the actual expected result.

So we think of loss as a function of all parameters to be trained,


So, we want the loss to be the smallest, when it actually happens, all of them

  $\frac{\partial F}{\partial w_{m}}=0$

Early on, people created a gradient descent algorithm, that is, according to the gradient

  $(\frac{\partial F}{\partial w_{0}},\frac{\partial F}{\partial w_{1}},…\frac{\partial F}{\partial w_{n}})$

In order to achieve the goal, the reverse direction of the adjustment parameter decreases to the place where the gradient is zero.




Because the neural network is one-level, all the functions of weight represented by loss are actually the coincidence functions of a pile of functions, and the derivation meets the chain rule. The gradient components of each weight can be calculated from back to front according to the chain rules, so people give it a name called back propagation (b)ackpropagation)。


Convolution neural network



Let’s take a look at the multilayer perceptron above. There is a connection between all inputs and outputs in each layer, which is called full connection.

(1) If there are too many full connection parameters, the network can easily become very heavy, which is not conducive to the development of the network in depth.
(2) Furthermore, full connection is a kind of processing of the whole image relationship. But life experience and traditional image processing based on signal processing tell us that image recognition often depends on the local. In other words, image recognition and processing are often based on the local logic of the image, rather than a pile of random pixels.
(3) For the same reason, full connection has no sense of image translation, rotation and other transformations, that is, it does not have translation invariance and rotation invariance.


So we have to find a new method, which is convolution.



After introducing convolution, the connection of neural network is much less than that of full connection, and it has translation invariance but no rotation invariance. Since then, convolutional neural network (CNN) has become the mainstream of neural network.

In the chain rule of derivation in each layer of the network, the product term increases with the increase of the number of layers. The traditional excitation function is used, because the derivative of the excitation function is mostly close to 0, it is easy to fall into the gradient and disappear. The derivative of the transition zone is very large, which may lead to another extreme gradient explosion. These are very disadvantageous to parameter training, either learning slowly or jumping too big.

Instead, we use the following excitation function called relu

There are even some other varieties, such as prelu, lrelu and so on.

In addition, the dimension reduction operation such as pooling is introduced, which brings some rotation invariance to the network at the same time.




The above is a maximum pooling. Sometimes, the average pooling is also used. However, it should be noted that the maximum pooling is a non-linear operation and may be used in more occasions.



Therefore, the basic components of the above convolution neural network have been established. The feature extraction layer is composed of convolution, excitation function and pooling, which is used to generate the necessary feature information, while the full connection of one or more layers is used for logic output, which is called logic layer.





As a classification system, generally speaking, the final output adopts one hot coding, which is more symmetrical. In other words, if it is N-classification, the final output of the network is an n-dimensional vector.








Classical, network for early handwritten recognition. Lenet uses a 5×5 convolution kernel, each convolution is followed by a 2×2 non overlapping maximum pooling.

In addition, the C3 layer convolutes 16 14×14 feature maps in batches. The paper describes the benefits of this approach

(1) Parameter reduction

(2) Asymmetric structure can provide a variety of combination features

By the way, the famous Minist data set is a handwritten recognition of the 10 digits of 0-9, which is generally used for beginners.


Although it can be used well for English handwriting recognition system, the network is not suitable for the emerging new applications.

Extensions to lenet are not as easy, and new improvements are urgently needed to adapt to emerging references.

Imagenet’s ilsvrc competition, which started in 2010, has also powered all teams.

As a result, DL technology began to progress.




Alexnet is the champion of ilsvrc in 2012. It is a network with many parameters. The convolution kernel of the first convolution layer is 11×11, and the full connection behind it is very large. It has many innovations

(1) This is the first network in the classic to introduce relu as the excitation function to resist the loss of gradient.

(2) The feature extraction layer has two independent groups, which can be parallel. The author uses two GPUs to work in parallel in feature extraction.

(3) Local response normalization (LRN) is adopted, which is inspired by the real neural network. The idea of normalization is to use multiple input tensors to adjust and adjust the current input tensor. Before normalization, the overall BP learning algorithm is not even realistic for a deeper network. The initial deep learning model is learning layer by layer. The normalization idea makes the overall BP learning become a reality.


Although the author of the network VGg later found that LRN seems to have no effect on VGg, now LRN as a technology has been rarely used or even eliminated, but the idea of normalization has always been.

(4) Different from lenet’s maximum pooling size 2×2 step 2, the overlapping pooling is introduced, and the maximum pooling size is 3×3 step 2. Although the output of the same size is basically generated, the experimental results show that the effect is better.

(5) Dropout learning mechanism is introduced, and some neural connections in the logic layer are randomly discarded to reduce over fitting.



All of the above are basically the independent creation of alexnet, which has a great impact on the back network.




You’re not wrong. This network was designed by Google. As a pioneer of AI, Google is bound to have a share. In addition, there is an L capital letter in it, which I didn’t write wrong. That’s Google’s tribute to the classic network lenet.




As the winner of ilsvrc in 2014, googlenet is a classic work of modular design.

Modularity is that it creates the following modules called inception



There is a problem of vision in image recognition, so the concept adopts 1×1, 3X3, 5×5 convolution and even 3×3 maximum pooling, which is for feature extraction of different views and scales of the same feature map. And then put them together, so that the feature extraction of different scales is put together. This idea is really good.

Then, in order to reduce the number of parameters, 1×1 convolution is used to reduce the dimension, so the module becomes like this:



The modular design idea is convenient for network modification. Goolenet uses average pooling instead of full connection, but for the convenience of fine tune, it finally provides a full connection layer. Dropput is still used in this network. In order to prevent the gradient from disappearing, two auxiliary softmax outputs are introduced in the middle of the network for observation, so we can see that there are two outputs above the whole network.

The idea of normalization continues. Batch normalization is created in concept V2 to replace the previous LRN.



N input tensors (pictures) are composed into a thing called mini batch, and the values of each tensor at the same position are normalized uniformly by the values of each tensor at the same position.

BN became the standard configuration of neural network, and has been continued.


In the next version of concept V3, we will continue to find ways to reduce the parameters and reduce the network size.


The above is the principle of replacing a 5×5 with two 3×3 convolution kernels. The two 3×3 fields of view are consistent with the 5×5 field of view.




With such an attempt, the convolution neural network is basically a small core of 3X3, no longer using a large core.

In this way, it is still not satisfied, and further reduce the parameters.


In order to further reduce the number of parameters, 1×3 and 3×1 convolution kernels are used to concatenate the field of view of 3×3.




A double slit structure is provided to reduce the size of feature map





One side is convoluted, the other side is pooled, and then stitched.

Later, concept gave a fourth version. Although this version solidified the modules, it did not have as many merits as the previous three versions.




Since it is deep learning, the natural network is still in the direction of depth, and only when the network becomes deeper can it have more flexible simulation ability. Many previous measures, such as relu, normalization, dropout, cutting into small cores and so on, are either to prevent gradient anomalies, or to combat over fitting and excessive parameters. Finally, they all want to deepen the network to expect better results.

RESNET is the first real network to make deep learning worthy of its name.

Inspired by residual analysis, RESNET introduces the concept of residual, that is, the sum of the feature map after several layers and the previous feature map, as shown in the figure:




RESNET is still a modular design. The weight layer above is a convolution layer. Relu is introduced to introduce nonlinear factors into f (x). It is meaningless for a single convolution layer to make residuals, because it is equivalent to a single convolution.

Of course, it can also be applied to the deeper residual module, and the addition can span more levels.





Why can we realize the real deep learning? To put it bluntly, our most fundamental need is to combat the disappearance of gradients so that the network can become deeper and more learnable

The Red 1 in the chain derivation rule above is the principle of preventing the gradient from disappearing.

The introduction of the above structure can greatly slow down the occurrence of gradient disappearance, so as to make the network deeper. From then on, the explosive mode was opened.




As a result, 100 layer network is no longer a big problem.

RESNET V2 has some module changes compared with v1


However, these are not major essential changes, and RESNET’s uneven method has been introduced into the network of later generations. From then on, neural network has entered the era of deep learning.


  Network development direction


With the development of embedded system, a lot of AI applications can be put on embedded devices, because of the resource limitation of embedded devices, it is urgent to make the network to the direction of lightweight computing and data.

In fact, hardware resources are limited, and computing speed is also the direction of pursuit. We can see the efforts for parameter reduction on googlenet.

Some of the later lightweight networks hope to “empty” the neurons, so as to move towards the direction of less parameters, but at the same time, the network graph is more complex, which means that the network expression ability is more powerful.



The above is a small network for object detection. It can be seen that the connection of the network is much more strange than that of the linear network. Although the parameters are less, the network is more complex.




Inspired by RESNET, this paper proposes a dense block module, in which any two layers in each dense block are connected, so there are more intensive residuals than RESNET,

Here is a deny block



The whole network is composed of several dense blocks and other convolution, pooling and full connection layers





Squeezenet introduces a module called fire module, which uses small-scale convolution, even 1×1 convolution.



Also inspired by RESNET, the introduction of bypass structure is essentially the residual block of RESNET






Mobilenet transforms convolution and introduces the depth convolution.

For each convolution kernel, all source feature maps are involved in convolution, and the computation is heavy.




And the depthwise convolution is this long



The convolution kernel is thus reduced. A featuremap generates a featuremap alone. Convolution does not change the number of featuremaps, and different featuremaps do not have direct effect in convolution.

If you want to change the number of featuremaps, you can add pointwise convolution after it. In other words, it is the convolution of 1×1 convolution kernel.

Because of the participation of depthwise contract, mobilenet achieves fewer parameters. In essence, depthwise contract is the extreme form of group contract. The so-called group contract is to group the incoming featuremaps, calculate each group in the form of ordinary convolution, and then output the whole set. However, group convolution is not a new concept at this time. As early as alexnet, we recall that these two parallelities are actually group convolution.


Even in the C3 layer of lenet, group revolution is already in use.


Mobilenet is a kind of network which is used more on embedded system and is used in various occasions, so it is called mobilenet.







Shufflenet also introduces depthwise contract and group contract.




The reason why it is called shufflet is that the network introduces shuffle from the middle, and repackages each group of feature maps into the next round of packet convolution



Previously, mobilenet used 1×1 convolution to associate different featuremaps together, but here we simply use shuffle to infiltrate information, without the amount of computation, so as to achieve a similar effect.




Some early network development technology will continue to be inherited by the later network, and the inefficient technology will lead to elimination or modification.

DL is very popular at present. What will AI technology look like in the future? Is neural network our ultimate AI model? Will semiotics, behaviorism and connectionism eventually penetrate each other thoroughly? Let’s wait and see.