Introduce: my friend small pats, sisters numerous, steady as an old dog. Sisters often wonder: little photo students

He knows how to eat, how to play, how to chat. He knows how to make up, how to wear, how to do reliable work and how to behaveDidn’t you ever catch up with a girl? If you want me to say, it’s nothing to make a fuss about. When I first met Xiaopai, he was still a straight man of steel.From the perspective of reinforcement learning, Xiaopai has done a good off policy learning in recent years, that is, learning from the process of “chasing girls” by themselves and others. No matter whether the result is successful or not, Xiaopai can learn something effectively.In this paper, we take “chasing girls” as an example to discuss the derailment strategy, and then extend the mathematical meaning of “importance sampling ratio / sampling rate” (mainly referring to the viewpoint of Professor Li Hongyi of Taiwan University).

**Table of contents:**

- Learning from other people’s experiences
- Importance sampling ratio: correcting the deviation of “understanding”

### Learning from other people’s experiences

In intensive learning,**The control strategy that we want to learn / convergence approximation must be optimal.**Taking chasing girls as an example, Xiaopai has only one purpose in mind**Pursue success in the best way.**

Obviously,**The strategy we need to learn must be a “strategy that can make us succeed”**But the problem is:

- Pat
**I have never succeeded, only have the experience of failure**What can he learn from it? - The experience of others is either successful or unsuccessful, but
**It’s impossible to copy a small picture completely**What can we learn from it?

For intensive learning, the answer to the above two questions is**sure**Yes.

In Sutton’s classic books, in**Section 5, Chapter 5**For the first time**“Off policy”**This concept.

Although the fifth chapter is only introduced, but**“Derailment strategy”**This concept, however, is almost**Strengthen learning practice**Because:

- Intensive learning
**Data is often only obtained by interacting with the environment**Because of this, the cost of data acquisition is too large and few; - The simple and direct iteration before section 5.5 in the book can only
**Using the current control strategy while improving the current control strategy (on Policy)**It is easy to cause some methods that have not been explored and will never be tried**So we can’t use the previous data and other people’s data**。

Xiaopai made an analogy for us:

**Same track strategy:**This time it failed. This method is not good. Well, improve this method and try it next time!**Derailment strategy:**The method I use is not necessarily the one that I think is the best one at present; or, no matter what method I use, I can learn from it, improve myself and find my best method. If others have experience, I can also learn from it!

As you can probably see,**The same orbit strategy is a special form of off orbit strategy**When designing the algorithm,**If it can meet the requirements of derailment strategy, it must be able to learn the same orbit strategy.**

And in practice, we**It’s hard not to use it**Derailment strategy:

- In the interaction with the environment, we try not to use the current optimal strategy
**(same track strategy learning method)**Because in this way, we will be “cautious” and dare not make creative attempts; - The previous data is to be reused, and the previous data is also generated under a policy different from the current policy.

### Importance sampling ratio: correcting the deviation of “understanding”

stay**Derailment strategy**Let’s go**We can’t use the “take it for granted” iterative method, because it will cause deviation in mathematical theory**In the end, we get bad learning effect. In order to prevent the “understanding” deviation, we need to use the sampling rate formula to correct the data obtained by using the strategy different from the current strategy.

If you prefer rigorous mathematical derivation, check out Sutton’s《Reinforcement Learning: An Introduction Second Edition》。 But to be honest, when I first studied this part in January this year, I didn’t know much about off policy and`Importance sampling ratio`

This concept.

As above, I found it in CSDNMy notes on this partNow it looks like,**It only wrote about the effect, but did not write “why” at that time.**

I learned later**Teacher Li Hongyi’s deep learning course**Mr. Li talked about some reinforcement learning: Mr. Li didn’t even introduce the basic hypothesis of MDP, but some of his views on reinforcement learning caught my eye, especially**Before introducing PPO, the paper introduces the distribution of sampling rate.**

Here, we only start from**Data sampling**Angle discusses the sampling rate.

After simple derivation, we find the relation between p-sampling and q-sampling`E_{x~q}`

Easy to get`E_{x~p}`

。 And that fraction is our sampling rate!

below**From distribution instance**Explain it in detail.

As mentioned above, the value of F (x) is used in data distribution`red thread`

express. We can see that:**If f (x) is sampled based on P (x), the final expected value should be negative, because P (x) always tends to sample on the left side of F (x) (the blue line in the graph is very high on the left side).**

**However, at present, we can only get data based on Q (x), and Q (x) always tends to sample on the right side of F (x) (the green line in the graph is very high on the right side). This leads to the positive f (x) data.**If the sampling rate is not added, we will mistakenly think that:`The f (x) of the sample under P (x) is expected to be a positive value.`

How to eliminate this deviation?**Enough sampling + sampling rate formula** 。

As mentioned above, when we have done enough sampling:**Although there is a very small probability of getting data on the left side under Q (x), once we get it, we will use it “very well” through the sampling rate.**

For example, in the green dot on the left in the figure above, because the value of Q (x) on the left is very small, while the value of P (x) is very large on the left, according to the sampling rate formula, we give a large weight to the data on the left, so that we “correct” the deviation. stay**Adequate sampling + sampling rate**Under the blessing of, we can correctly estimate:`The f (x) of the sample under P (x) is expected to be a negative value.`

I attach great importance to the correctness of my article. If you have different opinions, please send me an email: [email protected] 。

Postscript: the original title of this article is

`How to understand the "sampling rate of off track strategy" in reinforcement learning? Let's make a simple derivation`

But when I wrote it later, I had a flash of lightThe derailment strategy is to learn the optimal strategy from the non optimal, which is not to learn the successful experience from failure!Combined with my friend’s personal experience (is there anything that my friend has been failing, but has been improving?)… so from the perspective of intensive learning, chasing so many girls, rounding the small shot is also considered to have been in love! Not much, brothers and sisters. Pay attention to the official account of “Piper egg nest”.