Intelligent maze strategy iterative algorithm for reinforcement learning

Time:2021-12-31

0x00 fundamentals of machine learning

Machine learning can be divided into three categories

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

Key points of three learning categories

  • Supervised learning requires setting parameters manually, setting labels, and then assigning data sets to different labels.
  • Unsupervised learning also needs to set parameters to group unlabeled data sets.
  • Reinforcement learning needs to manually set the initial parameters, and then continuously modify the parameters through data feedback to make the function appear the optimal solution, that is, we think the most perfect strategy.

Principles of machine learning

  • Provide data (training data or learning data) to the system and automatically determine the parameters of the system through the data.

There are many common machine learning algorithms, such as

  • logistic regression
  • Support vector machine
  • Decision tree
  • Random forest
  • neural network

But does it really need to be learned in the order of probability theory – > linear algebra – > Advanced Mathematics – > machine learning – > deep learning – > neural network?

  • Not necessarily, because when you finish learning, you may find that you don’t know what you really need. On the contrary, you learn a lot of knowledge points that you won’t use after reading them once.

0x01 reinforcement learning background

Reinforcement learning was very popular when it first appeared, but then it gradually became cold. The main reason is that reinforcement learning can not well solve the state reduction representation. There are mainly states and actions in the agent. For example, the state can be regarded as a human position on the earth, and the action can be regarded as human walking. If we know the human position and action, we can actually predict the next state and action, But in fact, the agent needs to collect all possible states into a table by enumerating. In our example, there are too many next states for a person. For example, now that I am in Beijing, I fly to Shanghai and Nanjing. No one knows where I fly. Why can’t AI break through to strong artificial intelligence? I think the main reason is that the computing power is not enough, and there is no way to list all the states of the agent. Nowadays, because deep learning can reduce the dimension of a large number of data, so that the data set retains the characteristics and reduces the volume, so that under the same computing power, reinforcement learning can collect more states of agents to obtain the optimal solution. Therefore, a new concept called deep reinforcement learning has emerged, It is an enhanced version of reinforcement learning that combines deep learning and reinforcement learning. This kind of learning can complete very difficult tasks for human beings.

0x02 establishment of maze

import numpy as np
import matplotlib.pyplot as plt

fig=plt.figure(figsize=(5,5))
ax=plt.gca()
#Draw walls
plt.plot([1,1],[0,1],color='red',linewidth=3)
plt.plot([1,2],[2,2],color='red',linewidth=2)
plt.plot([2,2],[2,1],color='red',linewidth=2)
plt.plot([2,3],[1,1],color='red',linewidth=2)
#Painting state
plt.text(0.5,2.5,'S0',size=14,ha='center')
plt.text(1.5,2.5,'S1',size=14,ha='center')
plt.text(2.5,2.5,'S2',size=14,ha='center')
plt.text(0.5,1.5,'S3',size=14,ha='center')
plt.text(1.5,1.5,'S4',size=14,ha='center')
plt.text(2.5,1.5,'S5',size=14,ha='center')
plt.text(0.5,0.5,'S6',size=14,ha='center')
plt.text(1.5,0.5,'S7',size=14,ha='center')
plt.text(2.5,0.5,'S8',size=14,ha='center')
plt.text(0.5,2.5,'S0',size=14,ha='center')
plt.text(0.5,2.3,'START',ha='center')
plt.text(2.5,0.3,'END',ha='center')
#Set drawing range
ax.set_xlim(0,3)
ax.set_ylim(0,3)
plt.tick_params(axis='both',which='both',bottom='off',top='off',labelbottom='off',right='off',left='off',labelleft='off')
#The current position S0 is circled in green
line,=ax.plot([0.5],[2.5],marker="o",color='g',markersize=60)
#Display diagram
plt.show()

Operation results

0x03 policy iteration algorithm

  for us humans, we can see how to go from start to end S0 – > S3 – > S4 – > s7-s8 at a glance
So for the machine, our normal routine is to write a route directly through programming to solve this problem, but such a program mainly depends on our own ideas. Now what we need to do is to strengthen learning, which is to let the machine learn how to take the route according to the data.

Basic concepts

  • The rules that define the behavior of agents in reinforcement learning are called policy policy use Πθ (s, a), which means that the probability of taking action a in state s follows the parameter θ Decision strategy Π。

Here, the state refers to the position of the agent in the maze, and the action refers to the four moving modes of up, right, down and left.
Π It can be expressed in various ways, sometimes in the form of functions.
Here, the probability of the next movement of the agent can be clearly expressed in the form of a table, where the row represents the state, the list shows the action, and the corresponding value represents the probability.

if Π Is a function, then θ Is the parameter in the function, in the table here, θ Represents a value used to convert the probability of taking a in the s state.

Define initial value

theta_0=np.array([[np.nan,1,1,np.nan], #S0
                      [np.nan,1,np.nan,1], #S1
                      [np.nan,np.nan,1,1], #S2
                      [1,1,1,np.nan], #S3
                      [np.nan,np.nan,1,1], #S4 
                      [1,np.nan,np.nan,np.nan], #S5
                      [1,np.nan,np.nan,np.nan], #S6
                      [1,1,np.nan,np.nan],  #S7
                      ]) # S8 bit target does not require policy

Operation results
![](https://img2020.cnblogs.com/blog/2097957/202106/2097957-20210607094752371-640969868.png

take θ Value to percentage

def int_convert_(theta):
  [m,n]=theta. Shape # gets the size of the matrix
  pi=np.zeros((m,n))
  for i in range(0,m):
    pi[i,:]=theta[i,:] / np. Nansum (theta [I,:]) # calculates the percentage
    pi=np. nan_ to_ Num (PI) # converts nan to 0
  return pi

Operation results

At present, it can meet the strategy of not hitting the wall and moving randomly.

Define the status of the move

def get_next_s(pi,s):
    direction=["up","right","down","left"]
    next_direction=np.random.choice(direction,p=pi[s,:])
    #Choose the direction according to probability
    if next_direction=="up":
        s_ Next = S-3 # status number - 3 when moving up
    if next_direction=="right":
        s_ Next = S + 1 # status number + 1 when moving to the right
    if next_direction=="down":
        s_ Next = S + 3 # status number + 3 when moving down
    if next_direction=="left":
        s_ Next = S-1 # status number - 1 when moving left
    return s_next

Define final status

def goal_ Maze (PI): # move continuously according to the defined policy
    S = 0 # set start location
    state_ History = [0] # records the list of agent tracks
    While (1): # loop execution until the agent reaches the end point
        next_s=get_next_s(pi,s)
        state_ history. Append (next_s) # record the next step status in the record table
        if next_ S = = 8: # reach the end
            break
        else:
            s=next_s
    return state_history

According to our assumption, the shortest path is from the bottom right to the bottom right. At this time, the state is 8, so select the state of 8 to represent the final path.
But in fact, the agent is based on what we started to set up θ If the value moves, it is a random motion. As long as the result is 8, it stops. Therefore, there are various motion paths. What we need to do is to find a way for the agent to learn to take the shortest path by itself

Appeal code running example

The first result is the original θ The second result is that we convert it into probability θ, The third result is to set the state of the agent, no action is set, and then according to θ Specify a policy that allows the agent to move continuously, which may produce a variety of results
Complete code above

import numpy as np
import matplotlib.pyplot as plt
def plot():
    fig=plt.figure(figsize=(5,5))
    ax=plt.gca()
    #Draw walls
    plt.plot([1,1],[0,1],color='red',linewidth=3)
    plt.plot([1,2],[2,2],color='red',linewidth=2)
    plt.plot([2,2],[2,1],color='red',linewidth=2)
    plt.plot([2,3],[1,1],color='red',linewidth=2)
    #Painting state
    plt.text(0.5,2.5,'S0',size=14,ha='center')
    plt.text(1.5,2.5,'S1',size=14,ha='center')
    plt.text(2.5,2.5,'S2',size=14,ha='center')
    plt.text(0.5,1.5,'S3',size=14,ha='center')
    plt.text(1.5,1.5,'S4',size=14,ha='center')
    plt.text(2.5,1.5,'S5',size=14,ha='center')
    plt.text(0.5,0.5,'S6',size=14,ha='center')
    plt.text(1.5,0.5,'S7',size=14,ha='center')
    plt.text(2.5,0.5,'S8',size=14,ha='center')
    plt.text(0.5,2.5,'S0',size=14,ha='center')
    plt.text(0.5,2.3,'START',ha='center')
    plt.text(2.5,0.3,'END',ha='center')
    #Set drawing range
    ax.set_xlim(0,3)
    ax.set_ylim(0,3)
    plt.tick_params(axis='both',which='both',bottom='off',top='off',labelbottom='off',right='off',left='off',labelleft='off')
    #The current position S0 is circled in green
    line,=ax.plot([0.5],[2.5],marker="o",color='g',markersize=60)
    #Display diagram
    plt.show()
def int_ convert_ (theta): # setting parameters in policy θ
    [m,n]=theta. Shape # gets the size of the matrix
    pi=np.zeros((m,n))
    # print(pi,m,n)
    for i in range(0,m):
        pi[i,:]=theta[i,:] / np. Nansum (theta [I,:]) # iterates through the rows and calculates the percentage of the value in each row
        #Nansum (theta [I,:]) is the sum of numbers other than Nan
        pi=np. nan_ to_ Num (PI) # converts nan to 0
    return pi

def get_ next_ S (PI, s): # set the next state of the agent
    direction=["up","right","down","left"]
    next_direction=np.random.choice(direction,p=pi[s,:])
    #Choose the direction according to probability
    if next_direction=="up":
        s_ Next = S-3 # status number - 3 when moving up
    if next_direction=="right":
        s_ Next = S + 1 # status number + 1 when moving to the right
    if next_direction=="down":
        s_ Next = S + 3 # status number + 3 when moving down
    if next_direction=="left":
        s_ Next = S-1 # status number - 1 when moving left
    return s_next

def goal_ Maze (PI): # move continuously according to the defined policy
    S = 0 # set start location
    state_ History = [0] # records the list of agent tracks
    While (1): # loop execution until the agent reaches the end point
        next_s=get_next_s(pi,s)
        state_ history. Append (next_s) # record the next step status in the record table
        if next_ S = = 8: # reach the end
            break
        else:
            s=next_s
    return state_history

if __name__=="__main__":
    theta_0=np.array([[np.nan,1,1,np.nan], #S0
                      [np.nan,1,np.nan,1], #S1
                      [np.nan,np.nan,1,1], #S2
                      [1,1,1,np.nan], #S3
                      [np.nan,np.nan,1,1], #S4 
                      [1,np.nan,np.nan,np.nan], #S5
                      [1,np.nan,np.nan,np.nan], #S6
                      [1,1,np.nan,np.nan],  #S7
                      ]) # S8 bit target does not require policy
    print(theta_0)
    print(int_convert_(theta_0))
    state_history=goal_maze(int_convert_(theta_0))
    print(state_history)
    Print ("the number of steps to solve the maze path is" + str (len (state_history) - 1))
    plot()

View agent trajectory

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
from IPython.display import HTML
def plot():
    fig=plt.figure(figsize=(5,5))
    ax=plt.gca()
    #Draw walls
    plt.plot([1,1],[0,1],color='red',linewidth=3)
    plt.plot([1,2],[2,2],color='red',linewidth=2)
    plt.plot([2,2],[2,1],color='red',linewidth=2)
    plt.plot([2,3],[1,1],color='red',linewidth=2)
    #Painting state
    plt.text(0.5,2.5,'S0',size=14,ha='center')
    plt.text(1.5,2.5,'S1',size=14,ha='center')
    plt.text(2.5,2.5,'S2',size=14,ha='center')
    plt.text(0.5,1.5,'S3',size=14,ha='center')
    plt.text(1.5,1.5,'S4',size=14,ha='center')
    plt.text(2.5,1.5,'S5',size=14,ha='center')
    plt.text(0.5,0.5,'S6',size=14,ha='center')
    plt.text(1.5,0.5,'S7',size=14,ha='center')
    plt.text(2.5,0.5,'S8',size=14,ha='center')
    plt.text(0.5,2.5,'S0',size=14,ha='center')
    plt.text(0.5,2.3,'START',ha='center')
    plt.text(2.5,0.3,'END',ha='center')
    #Set drawing range
    ax.set_xlim(0,3)
    ax.set_ylim(0,3)
    plt.tick_params(axis='both',which='both',bottom='off',top='off',labelbottom='off',right='off',left='off',labelleft='off')
    #The current position S0 is circled in green
    line,=ax.plot([0.5],[2.5],marker="o",color='g',markersize=60)
    #Display diagram
    plt.show()
def int_ convert_ (theta): # setting parameters in policy θ
    [m,n]=theta. Shape # gets the size of the matrix
    pi=np.zeros((m,n))
    # print(pi,m,n)
    for i in range(0,m):
        pi[i,:]=theta[i,:] / np. Nansum (theta [I,:]) # iterates through the rows and calculates the percentage of the value in each row
        #Nansum (theta [I,:]) is the sum of numbers other than Nan
        pi=np. nan_ to_ Num (PI) # converts nan to 0
    return pi

def get_ next_ S (PI, s): # set the next state of the agent
    direction=["up","right","down","left"]
    next_direction=np.random.choice(direction,p=pi[s,:])
    #Choose the direction according to probability
    if next_direction=="up":
        s_ Next = S-3 # status number - 3 when moving up
    if next_direction=="right":
        s_ Next = S + 1 # status number + 1 when moving to the right
    if next_direction=="down":
        s_ Next = S + 3 # status number + 3 when moving down
    if next_direction=="left":
        s_ Next = S-1 # status number - 1 when moving left
    return s_next

def goal_ Maze (PI): # move continuously according to the defined policy
    S = 0 # set start location
    state_ History = [0] # records the list of agent tracks
    While (1): # loop execution until the agent reaches the end point
        next_s=get_next_s(pi,s)
        state_ history. Append (next_s) # record the next step status in the record table
        if next_ S = = 8: # reach the end
            break
        else:
            s=next_s
    return state_history

#Animation display
def init():
    #Initialization background
    line.set_data([],[])
    return (line,)

def animate(i):
    #Picture of each frame
    state=state_history[i]
    x=(state % 3)+0.5
    y=2.5-int(state/3)
    line.set_data(x,y)
    return (line,)


if __name__=="__main__":
    theta_0=np.array([[np.nan,1,1,np.nan], #S0
                      [np.nan,1,np.nan,1], #S1
                      [np.nan,np.nan,1,1], #S2
                      [1,1,1,np.nan], #S3
                      [np.nan,np.nan,1,1], #S4 
                      [1,np.nan,np.nan,np.nan], #S5
                      [1,np.nan,np.nan,np.nan], #S6
                      [1,1,np.nan,np.nan],  #S7
                      ]) # S8 bit target does not require policy
    print(theta_0)
    print(int_convert_(theta_0))
    state_history=goal_maze(int_convert_(theta_0))
    print(state_history)
    Print ("the number of steps to solve the maze path is" + str (len (state_history) - 1))
    fig=plt.figure(figsize=(5,5))
    ax=plt.gca()
    #Draw walls
    plt.plot([1,1],[0,1],color='red',linewidth=3)
    plt.plot([1,2],[2,2],color='red',linewidth=2)
    plt.plot([2,2],[2,1],color='red',linewidth=2)
    plt.plot([2,3],[1,1],color='red',linewidth=2)
    #Painting state
    plt.text(0.5,2.5,'S0',size=14,ha='center')
    plt.text(1.5,2.5,'S1',size=14,ha='center')
    plt.text(2.5,2.5,'S2',size=14,ha='center')
    plt.text(0.5,1.5,'S3',size=14,ha='center')
    plt.text(1.5,1.5,'S4',size=14,ha='center')
    plt.text(2.5,1.5,'S5',size=14,ha='center')
    plt.text(0.5,0.5,'S6',size=14,ha='center')
    plt.text(1.5,0.5,'S7',size=14,ha='center')
    plt.text(2.5,0.5,'S8',size=14,ha='center')
    plt.text(0.5,2.5,'S0',size=14,ha='center')
    plt.text(0.5,2.3,'START',ha='center')
    plt.text(2.5,0.3,'END',ha='center')
    #Set drawing range
    ax.set_xlim(0,3)
    ax.set_ylim(0,3)
    plt.tick_params(axis='both',which='both',bottom='off',top='off',labelbottom='off',right='off',left='off',labelleft='off')
    #The current position S0 is circled in green
    line,=ax.plot([0.5],[2.5],marker="o",color='g',markersize=60)
    anim=animation.FuncAnimation(fig,animate,init_func=init,frames=len(state_history),interval=20,repeat=False,blit=True)
    plt.show()

It is intuitively reflected from the animation that although the agent will eventually arrive at S8, it is not really directly to S8. We need to use reinforcement learning algorithm to let the agent learn how to arrive at S8 with the shortest path

There are two main ways

  • When reaching the target according to the strategy, the action performed by the strategy that reaches the target faster is more important. Update the strategy, adopt this action more in the future, and emphasize the action of successful cases. (strategy iteration method)
  • From the target, reverse calculate the state of the first step and the first two steps of the target, and guide the agent behavior step by step. It is a scheme to add value to the location outside the target. (value iteration method)

Here you can use the softmax function to convert theta to an exponential percentage.

def softmax_convert_into_pi_from_there(theta):
    beta=1.0
    [m,n]=theta.shape
    pi=np.zeros((m,n))
    exp_theta=np.exp(beta * theta)
    for i in range(0,m):
        pi[i,:]=exp_theta[i,:] / np.nansum(exp_theta[i,:])
    pi=np. nan_ to_ Num (PI) # converts nan to 0
    return pi

Then get the state and action of the agent

def get_ action_ and_ next_ S (PI, s): # set the next state of the agent
    direction=["up","right","down","left"]
    next_direction=np.random.choice(direction,p=pi[s,:])
    #Choose the direction according to probability
    if next_direction=="up":
        action=0
        s_ Next = S-3 # status number - 3 when moving up
    if next_direction=="right":
        action=1
        s_ Next = S + 1 # status number + 1 when moving to the right
    if next_direction=="down":
        action=2
        s_ Next = S + 3 # status number + 3 when moving down
    if next_direction=="left":
        action=3
        s_ Next = S-1 # status number - 1 when moving left
    return s_next

Set objective function

def goal_ maze_ ret_ s_ A (PI): # move continuously according to the defined policy
    S = 0 # set start location
    s_ a_ History = [[0, NP. Nan]] # records the list of agent tracks
    While (1): # loop execution until the agent reaches the end point
        [action,next_s]=get_action_and_next_s(pi,s);
        s_ a_ History [- 1] [1] = action # substituting the current state indicates the action of the last state

        s_a_history.append([next_s,np.nan])
        #Substitute into the next state. Because I don't know his action, I use Nan
        if next_s==8:
            break
        else:
            s=next_s
    return s_a_history

Compared with the previous strategy, the new strategy has more actions of agents. In theory, there is really the prototype of agents now.

But just like this, intensive learning is still dead.
Be sure to make the data in the policy θ Because it is a policy iteration, it is better to use a short path in each run. On this basis, we continue to shorten the path, train and finally find the shortest path.

Update strategy according to strategy gradient method

  • New strategy = old strategy + learning rate * increased strategy

θ Si, AJ represents a parameter used to determine the probability of taking action AJ in state Si. N is called learning coefficient and control θ Si, AJ update size in a single learning. If n is too small, you will learn slowly, and if n is too large, you will not learn normally.
N (Si, AJ) is the number of actions AJ taken in state Si, P (Si, AJ) is the probability of taking actions AJ in state Si in the current strategy, and n (Si, a) is the total number of actions taken in state Si.
Careful observation shows that if the agent is ready to move downward at S0 at the beginning, the action in this state is always 1, the probability is 0.5, and the number of actions taken in this state is 1, which is actually 0.5/t
If the agent is ready to move down and right, the action in this state is always 2, the probability for the down action is 0.5, and the number of actions taken in this state is 1 θ It hasn’t changed.
If the agent is ready to move in S3 and is ready to move in the lower right and upper three directions, the probability of each action is always 3, and the number of times of taking an action is 1, which is actually equivalent to θ No change.

If it is repeated, such as s0-s3-s0-s3, in this case, S0 moves downward, the total number of actions is 1, the probability is 0.5, and the number of actions taken in this state is 2. In fact, it is 2-0.5/t and t. obviously, t will also increase in this case.
The meaning of the algorithm is to ensure θ Almost unchanged makes t decrease, while t decreases. The agent walks once in a certain direction.
Because if it is repeated, the calculation formula on T will be relatively large, and T will become larger at this time.

code implementation

def update_theta(theta,pi,s_a_history):
    eta=0.1
    T=len(s_a_history)-1
    [m,n]=theta.shape
    delta_theta=theta.copy()

    for i in range(0,m):
        for j in range(0,n):
            if not(np.isnan(theta[i,j])):
                SA_i=[SA for SA in s_a_history if SA[0] == i]
                #Fetch state I
                SA_ij=[SA for SA in s_a_history if SA == [i,j]]
                #Fetch state I下应该采取的动作
                print(SA_i)
                N_ I = total number of actions in len (sa_i) # state I
                print(N_i)
                print(pi[i,j])
                N_ Ij = number of times to take action J in len (sa_ij) # state I
                print(SA_ij)
                print(N_ij)
                delta_theta[i,j]=(N_ij-pi[i,j]*N_i)/T
    new_theta=theta+eta*delta_theta
    return new_theta

Then repeat the search and update the parameters in the maze θ, Until you can walk all the way in a straight line to solve the maze problem
critical code

stop_ Epsilon = 10 * * - 3 # if the change of strategy is less than the - 4th power of 10, the learning ends
    theta=theta_0
    pi=pi_0
    is_continue=True
    count=1
    while is_continue:
        s_ a_ history=goal_ maze_ ret_ s_ A (PI) # by policy Π Go search
        new_ theta=update_ Theta (theta, PI, s_, a_history) # update parameters
        new_ pi=softmax_ convert_ into_ pi_ from_ There (new_theta) # update policy

        Print (NP. Sum (NP. ABS (new_pi))) # output policy changes
        Print ("steps to solve maze problem" + str (len (s_a_history) - 1))
        if np.sum(np.abs(new_pi-pi))

The final running result is really wonderful! Note that the parameters and policies must be updated in time through the loop. Because once you update the first wave strategy, the second wave strategy is equivalent to constant trial and error. If you go in a certain direction and cause a large T, the probability of this direction will be reduced in the third wave strategy. Setting theta as a range here is to ensure θ Basically unchanged, because if the number of repetitions is large, for the equation of parameter increase, the numerator is increasing, and the corresponding denominator will become larger. For a state that only goes once, the numerator is decreasing, so t will naturally decrease. Then the problem comes, why does the difference between them become smaller when updating the strategy, The policy update is affected by the parameter update. The parameter update is to ensure that each state only goes in one direction at a time. At the beginning, one state may go in three directions, and after the update, one state goes in two directions. The difference between the strategies is still large. Through continuous trial and error, t remains unchanged, θ The increase always decreases, and a shortest path can be determined.

0x04 simple summary

The strategy iteration method makes me feel the charm of reinforcement learning. The algorithm first defines the state and action of the agent through the strategy, and then defines the parameters θ As a control quantity, it also represents the probability of the next movement direction of the agent, and then uses the strategy iteration to continuously update the strategy by updating the parameters, because the change of the parameters is determined by the difference between the actual direction number ideal direction number and the last recorded step number. And there is an implicit qualitative condition, that is, in a certain state, a certain direction only goes once, and the parameter changes to almost 0. Considering that the direction probability of a single state changes after each parameter update. But in the final fitting, t is close to a constant so that θ Can keep shrinking, so we just need to set θ The variation range of makes θ It remains almost unchanged, and it can be considered that the shortest path has been found.

0x05 gitee reference

url : https://gitee.com/arg1nt/RL.git