[RL] prediction and control mc, TD (λ), SARS a, Q-learning, etc


This intensive learning series is based on the learning link to David Silverhttp://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

After introducing the basic concepts of RL and MDP in the previous article, this paper introduces the basic concepts of RL and MDPmodel-freeIn this case (i.e. do not know the return RS and the state transition matrix PSS’), how to proceedpredictionThat is to predict the state value function v (s) of the current policy, so as to know whether the policy is good or bad, and how to implement itcontrolThat is, find the optimal policy (that is, find Q * (s, a), so that π * (a | s) can be known immediately).

In the prediction part, the paper introduces three sample estimation methods: Monte Carlo (MC), TD (0) and TD (λ). Here, the practical significance of λ is to randomly sample a state and consider the value of the following states. Whenλ = 0 is TD (0)Only the next state is considered;When λ = 1, it is almost equal to MCConsider T-1 subsequent States, that is, to the end of the whole episode sequence;TD (λ) when λ∈ (0,1)In other words, the following n ∈ (1, t-1) states are considered. At the same time, the prediction process is divided intoonline updateandoffline updateTwo parts, namely, using the current π to get the return V (s’) of the next state and optimizing whether the policy π is carried out at the same time.

In the control part, it introduces how to find the optimal strategy, that is to find Q * (s, a). It is divided into two parts: on policy and off policy, that is, without or referring to other people’s strategies (for example, robots propose better behaviors by observing human behaviors). stayon-policySection, introduces theMC controlSARSASARSA(λ)Three methods,off-policyPart of the introductionQ-learningMethods.

Details will be introduced in the following text.


1、 Prediction of model free

  1. Monto Carlo(MC) Learning

  2. Temporal-Difference TD(0)

  3. TD(λ)

2、 Model free control

1. Monte Carlo (MC) control of on policy

2. Temporal difference (TD) learning of on policy – SARS a and SARS a (λ)

3. Q-learning of off policy


1、 Prediction of model free

The content of the prediction section,None of them involve action。 Because it is to measure the quality of the current policy, you only need to estimate the state value function v (s) of each state. In the previous paper, we introduced the use of Bellman expectation equation to obtain the mathematical expectation of a certain state V (s). We can use dynamic programming DP to traverse all the States. However, the efficiency is very low. The following two methods are introduced to obtain V (s) by using sample sampling.

1. Monto Carlo(MC) Learning

Monte Carlo is a word that is often encountered. Its core idea is to randomly sample samples from the state st to obtain a lot of complete sequences. The actual benefits of complete sequences can be obtained, so the average value is V (st). It should be noted that MC can only be used forterminateSequence of(complete)Medium.

In the process of MC sample, each state point value functionEach traversal n (s) + = 1, s (s) is the total return, each time s (s) + = GT, where GT represents the return obtained in the t-th samplingWhen,The average value V (s) is close to the true value v π (s). In the process of obtaining the complete sequence,It is likely to encounter a loop, that is, a state point passes through many times. There are two methods to deal with this MC, first step (n (s) + = 1 only when it passes through the first time) and every step (n (s) + = 1 every time it passes through this point).

V (s) can be regarded as the average of all the returns of S. However, the average value can be calculated not only by summing and then division, but also by adding a little difference to the existing average value, which is in the form of the following left formula. The existing average value is V (st), and the return obtained in this sampling is GT, and the difference between the current average value and the current average value is (GT – V (st)). However, in this way, an n (s) counter needs to be maintained all the time. However, only one optimization direction is needed for real average optimization. Therefore, 1 / N (st) is replaced by a (0,1) constant α, which is the form of the following right formula.The practical significance of α is a forgetting coefficient, that is, the old sampling results of appropriate degree do not need to remember all the sequences from the sample clearly.


2. Temporal-Difference TD(0)

One obvious drawback of Monte Carlo sampling is that it is necessary to sample a complete sequence in order to observe the return of this sequence. But TD (0) does not need this method. It uses Bellman equation. The current state return is only related to the timely return RT + 1 and the next state return (as follows). The red part is TD target, and the right bracket of α is TD error. Therefore, TD (0) only samples the next state point st + 1, and calculates RT + 1 and V (st + 1) with the existing policy. This method of estimating with known information is calledbootstrapping(updates a guess towards a guess)MC is the average of the observed actual values, and there is no bootstrapping. Since TD (0) only needs sample to give the next state st + 1, it can be used fornon-terminateIn sequence(incomplete)。

 Compared with MC, TD (0) uses the existing policy to predict TD error, which is larger than MC’s actual value. However, TD (0) only needs sample to produce the next state sequence instead of MC’s complete sequence, so the variance of TF (0) prediction is smaller than MC’s.

The specific calculation example of V (s) is as follows: Take 8 samples without considering discount factor γ. What are the V (a) and V (b) obtained in the figure below?


It can be seen that both TD and MC, V (b) are calculated by taking the average value of 0.75; but V (a) calculated by MC is 0, because a has only one time, sample result is 0, TD (0) calculates 0.75, because the next state of a is B and V (b) = 0.75, r = 0. From this point of view, TD algorithm can make better use of Markov characteristics; TD (0) only samples the results of the next state point, and does not need to wait until the end of the final sequence for each sample, so it is more efficient than MC; however, due to the bootstrapping method, it is more affected by the initialization value, and its fitting performance is not as good as MC. The following figure shows the relationship between DP algorithm of dynamic programming, TD (0) algorithm of next state of sample sampling, and MC algorithm of sampling complete sequence of T-1 states. However, many points between 1 and T-1 can also be considered by sample, which is a more general expression of TD (λ).


3. TD(λ)

Here, λ is a real number of ∈ [0,1], which determines the number of state points st + N, n from 1 to infinity. The left formula, Gt(n)Represents the return obtained when sampling to st + N, so how to synthesize GT(1)To GT(n)In this paper, we multiply each GT by a coefficient (1 – λ) λ ^ (n-1). Since ∑λ ^ n = 1 / (1 – λ), the right formula ∑ = 1.


Here GTλAs shown in the figure below, the sum of the areas under the curve is 1. It can be seen that every time the weight is reduced to the original λ times (λ∈ [0, 1] In this way, the weight of the later state gradually decreases, which can achieve the effect of reducing λ and reducing n. here, the real number λ of the control coefficient can control n. for example, when λ = 0, the function expression of TD (0) is the same as that of TD (0). When λ = 1, a state st + 1 is sampled in the future, and when λ = 1, each time it is not reduced, it means that the sampling is equivalent to MC at the end.


The sample of temporal difference is divided intoforward-view(left below) andbackward-view(right below) two parts. Forward view is just described in the following sample n state points (controlled by λ) to see the future return, similar to the sequence that MC can only be used for complete. However, backward view can be used in incomplete sequences. In the process of sample, each update step maintains an flexibility trace to record the information of the unfinished sequence, and finally becomes the weight of updating the average value.


   thereEligibility Trace(the following left formula) is similar to the frog’s well jump. Every time a frog jumps, it will rise. If it doesn’t jump, it will gradually fall down. It combines both frequency heuristics and recurrence heuristics. The “jump” here is equivalent to the point s obtained by the state. In this way, the value function update of backward view can be written in the form of the following right formula. When λ = 0, et (s) is 1 only after s, and the expression of the following right formula is the same as TD (0); when λ = 1, each state point in the whole sequence has an eligibility trace value, which can be regarded as MC. Each update of the value function refers to TD error()Reference is also made to the eligibility trace ET (s).


The function of eligible trace can be seen as follows: the vertical axis represents the accumulated value, the horizontal axis represents the time, and| represents the time point when the state s is encountered


When λ = 0, only update in the current state. The left formula in the figure below is completely equivalent to the right formula of TD (0)




  At the same time, TD (1) is completely equivalent to MC (every visit). It can be seen from the following formula that when λ = 1, et (s) = γ ^ t-k, the sum of continuous TD errors isIt can be regarded as MC error of complete sequence T-1 from sample.



When the sequence environment isoffline updateThe forward view is the same as the backward view (see the following formula, the left can be simplified to the right form), and TD error can be reduced to the form of λ – error.


  When the sequence environment isonline updateWhen the policy is optimized while sampling, the backward view will always accumulate an error, as shown below. Therefore, if you access s multiple times, the error will become very large.


   A summary table is as follows:




2、 Model free control

Hereon-policyIt can be seen as “learn on the job”, that is, the reflection and optimization of their own behavior;off-policyIt can be seen as “look over someone’s shoulder”, that is, by observing the behavior of other agents, optimization is proposed.

In the process of searching for the optimal policy, it can be seen aspolicy evaluationandpolicy improvementTwo parts, one of the most basic greedy algorithm to improve, as shown in the figure below. Every time v is evaluated by the current policy π, and then the strategy π is greedy to find a better value function corresponding state, which can ensure that V and π eventually converge to the optimal V * and π *.




The following methods are all based on the evaluation and improvement framework. They are only model free and cannot be carried out without the relevant values of the state transition matrix PTherefore, we can only use the action value function Q (s, a) through theOptimize the policy.

Moreover, for the policy evaluation of the sample method described below, it is not accurate to use the greedy algorithm alone, because it is likely that a certain state will have a higher return under an action, but if there is no sample to the higher return value, the path will be ignored and fall into the local optimal solution. So here we propose the use ofε-greedyInstead of the absolute greedy algorithm (ε∈ [0,1]), it becomes softer. Every time 1 – ε probability selects the best action known, and ε randomly selects all the action sets, as shown in the following left formula and the following right formulaThe results show that V π ‘(s) ≥ V π (s) obtained by ε – greedy is more optimal

It is a simple method, but it can explore a larger state space to obtain a more global optimal solution.


1. Monte Carlo (MC) control of on policy

The MC method is used to estimate the action value function, and then theε – greedy makes strategy optimization. But here we can do optimization on the basis of the original MC control, that is, we don’t need to estimate the exact Q π every timeQ qπThat is to say, if we optimize on Q, we are still moving towards the direction of optimal Q *.



Pit filling..