Reinforcement learning deviant strategy: gaining successful experience from failure — a case study of chasing girls | mathematical significance of sampling rate

Time:2020-11-26

Introduce: my friend small pats, sisters numerous, steady as an old dog. Sisters often wonder: little photo studentsHe knows how to eat, how to play, how to chat. He knows how to make up, how to wear, how to do reliable work and how to behaveDidn’t you ever catch up with a girl? If you want me to say, it’s nothing to make a fuss about. When I first met Xiaopai, he was still a straight man of steel.From the perspective of reinforcement learning, Xiaopai has done a good off policy learning in recent years, that is, learning from the process of “chasing girls” by themselves and others. No matter whether the result is successful or not, Xiaopai can learn something effectively.In this paper, we take “chasing girls” as an example to discuss the derailment strategy, and then extend the mathematical meaning of “importance sampling ratio / sampling rate” (mainly referring to the viewpoint of Professor Li Hongyi of Taiwan University).

Table of contents:

  • Learning from other people’s experiences
  • Importance sampling ratio: correcting the deviation of “understanding”

Learning from other people’s experiences

In intensive learning,The control strategy that we want to learn / convergence approximation must be optimal.Taking chasing girls as an example, Xiaopai has only one purpose in mindPursue success in the best way.

Reinforcement learning deviant strategy: gaining successful experience from failure -- a case study of chasing girls | mathematical significance of sampling rate

Obviously,The strategy we need to learn must be a “strategy that can make us succeed”But the problem is:

  • PatI have never succeeded, only have the experience of failureWhat can he learn from it?
  • The experience of others is either successful or unsuccessful, butIt’s impossible to copy a small picture completelyWhat can we learn from it?

For intensive learning, the answer to the above two questions issureYes.

In Sutton’s classic books, inSection 5, Chapter 5For the first time“Off policy”This concept.

Reinforcement learning deviant strategy: gaining successful experience from failure -- a case study of chasing girls | mathematical significance of sampling rate

Although the fifth chapter is only introduced, but“Derailment strategy”This concept, however, is almostStrengthen learning practiceBecause:

  • Intensive learningData is often only obtained by interacting with the environmentBecause of this, the cost of data acquisition is too large and few;
  • The simple and direct iteration before section 5.5 in the book can onlyUsing the current control strategy while improving the current control strategy (on Policy)It is easy to cause some methods that have not been explored and will never be triedSo we can’t use the previous data and other people’s data

Reinforcement learning deviant strategy: gaining successful experience from failure -- a case study of chasing girls | mathematical significance of sampling rate

Xiaopai made an analogy for us:

  • Same track strategy:This time it failed. This method is not good. Well, improve this method and try it next time!
  • Derailment strategy:The method I use is not necessarily the one that I think is the best one at present; or, no matter what method I use, I can learn from it, improve myself and find my best method. If others have experience, I can also learn from it!

As you can probably see,The same orbit strategy is a special form of off orbit strategyWhen designing the algorithm,If it can meet the requirements of derailment strategy, it must be able to learn the same orbit strategy.

And in practice, weIt’s hard not to use itDerailment strategy:

  • In the interaction with the environment, we try not to use the current optimal strategy(same track strategy learning method)Because in this way, we will be “cautious” and dare not make creative attempts;
  • The previous data is to be reused, and the previous data is also generated under a policy different from the current policy.

Importance sampling ratio: correcting the deviation of “understanding”

stayDerailment strategyLet’s goWe can’t use the “take it for granted” iterative method, because it will cause deviation in mathematical theoryIn the end, we get bad learning effect. In order to prevent the “understanding” deviation, we need to use the sampling rate formula to correct the data obtained by using the strategy different from the current strategy.

If you prefer rigorous mathematical derivation, check out Sutton’s《Reinforcement Learning: An Introduction Second Edition》。 But to be honest, when I first studied this part in January this year, I didn’t know much about off policy andImportance sampling ratioThis concept.

Reinforcement learning deviant strategy: gaining successful experience from failure -- a case study of chasing girls | mathematical significance of sampling rate

As above, I found it in CSDNMy notes on this partNow it looks like,It only wrote about the effect, but did not write “why” at that time.

I learned laterTeacher Li Hongyi’s deep learning courseMr. Li talked about some reinforcement learning: Mr. Li didn’t even introduce the basic hypothesis of MDP, but some of his views on reinforcement learning caught my eye, especiallyBefore introducing PPO, the paper introduces the distribution of sampling rate.

Here, we only start fromData samplingAngle discusses the sampling rate.

Reinforcement learning deviant strategy: gaining successful experience from failure -- a case study of chasing girls | mathematical significance of sampling rate

After simple derivation, we find the relation between p-sampling and q-samplingE_{x~q}Easy to getE_{x~p}。 And that fraction is our sampling rate!

belowFrom distribution instanceExplain it in detail.

Reinforcement learning deviant strategy: gaining successful experience from failure -- a case study of chasing girls | mathematical significance of sampling rate

As mentioned above, the value of F (x) is used in data distributionred threadexpress. We can see that:If f (x) is sampled based on P (x), the final expected value should be negative, because P (x) always tends to sample on the left side of F (x) (the blue line in the graph is very high on the left side).

However, at present, we can only get data based on Q (x), and Q (x) always tends to sample on the right side of F (x) (the green line in the graph is very high on the right side). This leads to the positive f (x) data.If the sampling rate is not added, we will mistakenly think that:The f (x) of the sample under P (x) is expected to be a positive value.

How to eliminate this deviation?Enough sampling + sampling rate formula

Reinforcement learning deviant strategy: gaining successful experience from failure -- a case study of chasing girls | mathematical significance of sampling rate

As mentioned above, when we have done enough sampling:Although there is a very small probability of getting data on the left side under Q (x), once we get it, we will use it “very well” through the sampling rate.

For example, in the green dot on the left in the figure above, because the value of Q (x) on the left is very small, while the value of P (x) is very large on the left, according to the sampling rate formula, we give a large weight to the data on the left, so that we “correct” the deviation. stayAdequate sampling + sampling rateUnder the blessing of, we can correctly estimate:The f (x) of the sample under P (x) is expected to be a negative value.

I attach great importance to the correctness of my article. If you have different opinions, please send me an email: [email protected]

Reinforcement learning deviant strategy: gaining successful experience from failure -- a case study of chasing girls | mathematical significance of sampling rate

Postscript: the original title of this article isHow to understand the "sampling rate of off track strategy" in reinforcement learning? Let's make a simple derivationBut when I wrote it later, I had a flash of lightThe derailment strategy is to learn the optimal strategy from the non optimal, which is not to learn the successful experience from failure!Combined with my friend’s personal experience (is there anything that my friend has been failing, but has been improving?)… so from the perspective of intensive learning, chasing so many girls, rounding the small shot is also considered to have been in love! Not much, brothers and sisters. Pay attention to the official account of “Piper egg nest”.