“Reinforcement learning, frankly speaking, is to establish the mapping between distribution and distribution”? Talking about personal views from the perspective of Mathematics


Introduction: Senior f is one of the most important enlightening persons on the road of mathematical model competition and scientific research methods.He successfully entered Tsinghua University last year. Coincidentally, his research direction is also intensive learning.During the long rounds of discussion, he called me and said that he was deeply impressed“Reinforcement learning, to put it bluntly, is to establish the mapping between distribution and distribution.”I’ve never heard such a point of view before, which made me calm down and rethink the mathematical hypothesis of reinforcement learning.I will analyze this point here.

The structure of this paper

Let’s start with the conclusion, iagree!This view. In order to demonstrate this point, I will start with the easiest to understand “supervised learning” based onThe essence of “classification problem” is to fit the distribution of different types of data.This consensus was launched and put forward“From the micro and macro point of view, the understanding of this learning system is different”This view. Then extended to intensive learning.


  • Micro and macro perspectives of deep learning
  • The micro and macro perspectives of reinforcement learning

Micro and macro perspectives of deep learning

Listening to Mr. Li’s deep learning course, I’m afraid the most common words are“Distribution”Two words: no matterAssumptions in basic derivationstillEach branch technology, such as GaN / counter attack model。 Deep learning users have such a principled statistical hypothesis in their mindsThe difference is reflected in the group, not in the individual. The implication is that the distribution of data is much more important than what a piece of data looks like.

Mr. Li is at the beginning of the courseClassification: Probabilistic Generative ModelIt has been proved in Chinese: when we do classification, we are actually rightThe distribution parameters of the classification were fittedIt’s just that,In this paper, some derivation is carried out, which leads us to fit the parameters of neural network directly and conveniently without considering the distribution parameters.The specific explanation is as follows.

As mentioned above, when we solve a binary classification problem, for the newly entered data $x $, it belongs to category $C_ Probability of $1 $p (c_ 1 | x) $can be usedPrior probability formulaexpress. This formula can be transformed into a sigmoid function by some simple transformations: I want to know that $x $is $C_ For the probability of $1 category, you only need to know the specific value of $Z $in the figure above.

As mentioned above, we are based on two assumptions:Both types of data obey a certain distribution (Gaussian distribution is used as an example)AndThe distribution covariance of the two kinds of data is equalIt can be deduced that $Z $can be reduced to the form of $W ^ t x + B $. The vector $W $and vector $B $are related to the constants in the Gaussian distribution, such as mean and covariance, but this is not important. Because we only need to know what $W $and $B $are, we can achieve “knowing that $x $is $C_ A $1 class probability “is the goal. The mean value and covariance may be obtained by maximum likelihood method, but it has many disadvantages compared with directly fitting $W $and $B $with neural networkThe maximum likelihood method is not an iterative algorithm (high memory requirement and inflexibility). The maximum likelihood method must find the correct distribution model

Therefore, the neural network parameters $W $and $B $we get are actually functions related to distributed parameters $W = F_ 1 (parameter of distribution) $, $B = f_ 2 (the parameter of the distribution) $, we can hardly get it by using $W $and $B $Parameters of distributionIt’s not necessary.

Therefore, I say that the understanding of this learning system is different from that of macro and micro perspectives.

Micro perspective

We regard deep learning model / neural network as a system, take cat and dog classifier as an example

  • The input to this system isA piece of data (such as RGB matrix corresponding to an image)
  • The output of this system isThe probability that this picture is a cat map is 0.72 and that of a dog is 0.28

This is a microscopic perspective, and neural network is similar to a function, which is easy to understand for everyone.

Macro perspective

With the mathematical foundation we mentioned this morning, from a macro point of view, this system is not just “input-output”.

Take cat and dog classifier as an example

  • We input the data (pictures) into the neural networkstudy
  • Its “learning” performance is the change of neural network parameters
  • We know that with a bunch of data, we can choose a distribution and fit the parameters of the distribution with these data

    • Neural network omits the process of selecting distribution(neural networks have been shown to be powerful enough to fit any distribution / function)
    • The parameters of the neural network are changing, and these parameters are actually f (distributed parameters)

Therefore, we can understand the learning system from a macro perspective

  • We keep feeding neural networks with a lot of data, in fact, for the sake ofLet the “heart” have a more accurate “feeling” of data distribution
  • But this “feeling,” that isNeural network does not know the specific distribution parameters, it can only do the task better and better

The micro and macro perspectives of reinforcement learning

The same is true for intensive learning. My understanding is:The distribution of reinforcement learning in learning is not the distribution from which the data itself comes from, but the distribution of the transfer probability from data to data.

Specifically, it’s non reinforcement learning, which is $p (x)_ 1) ~ distribution $is the distribution, while reinforcement learning is $p (x)_ 1 | x_ 2) ~ distribution $.

In the picture above, I want to express that “Markov process” is one of the basic assumptions of reinforcement learning. All reinforcement learning testing environments can be used in theoryState transition matrix of Markov processDescription, that’s what we sayData to data distribution

Micro perspective

If we regard reinforcement learning as a system, we can take the “jump and jump” game as an example

  • The input to this system isCurrent status (current position, target position)
  • The output of this system isAction (how far to jump)

Similarly, in the micro perspective, neural network is similar to a function.

Macro perspective

However, with the above mathematical foundation, we know that:

  • We put the dataThe data here are at least paired, and can represent the context, such as $(x)_ 1, x_ 2)$,$(x_ 2, x_ 3) $Input the neural network continuously to make itstudy
  • Its “learning” performance is the change of neural network parameters
  • The neural network parameter here is actually f (state transition matrix)
  • On the macro level, the output of neural network isGive a strategy according to your “feeling” of state transition(for example, shooting: I can probably shoot at this distance with such great strength; but I can’t tell how big it is according to what formula.)

similarly,It is almost impossible to deduce the state transition matrix from the neural network parameters.


These thoughts were carried out after I heard the senior student’s view that “reinforcement learning, frankly speaking, is to establish the mapping between distribution and distribution”.

To sum up, like supervised learning, reinforcement learning is to establish the mapping between distribution and distribution, but reinforcement learning is special in

  • For mappingIs the distribution ofState transition matrixThe distribution of
  • MappingIs the distribution ofAction strategyThe distribution of

Finally, explainDistribution of action strategiesTake my shooting practice as an example

  • I practiced a lot of shots, so I got “sequence data” and learned “feeling” (actuallyDistribution of state transition matrix
  • With this “feeling,” I know better where to do what (this isDistribution of action strategies

Thank you for your advice. Although I don’t always agree with my seniors (I’m born to question…), the level of seniors is really high. I wish you success in your studies. I hope you will gain something or raise questions.

Thank you for reading to the end! I am Xiaopai, a computer technology enthusiast! If you think the article is good, you can click “looking” to support me! If you have any criticism, suggestions or cooperation, please email me [email protected] Or focus on the publicPiper's nest, reply to “wechat” to add my wechat contact~