Using Monte Carlo method to achieve the optimal solution of the 21 point problem (including Python source code)


To get the full code,Please visit bread moreSupport Oh, just a mouthful of milk tea!

1、 Purpose of the experiment

Realize the optimal solution of 21 point problem based on Monte Carlo method, understand the basic principle of reinforcement learning, understand Monte Carlo method and write the corresponding code.

2、 Experiment content

The goal of the popular Blackjack card game in casinos is to get as many cards as possible with the sum of the numbers not exceeding 21. All humanoid faces count as 10, and a can count as 1 or 11. Our experiment only considers the version in which each player competes with the dealer independently. At the beginning of the game, both the dealer and the player have two cards. The dealer’s card is face up and the other is face down. If a player has 21 cards (one a and one 10), it is called a natural card. He wins, unless the dealer also has a natural card, in which case the game is tied. If the player doesn’t have a natural card, he can ask for extra cards, hits, until he stops(sticks)Or more than 21 (goes bust). If he goes bankrupt, then he loses. If he insists, then it’s the turn of the dealer. The dealer hits or sticks or goes bust; when the sum of the number of cards is 17 or more, the dealer stops dealing. Win, lose or draw is decided by whose final sum is closer to 21.

3、 Experimental process

This experiment needs to import the following packages:

import gym
import numpy as np
from collections import defaultdict
import matplotlib
import matplotlib.pyplot as plt

The next programming is to use the 21 point game of gym

env = gym.make('Blackjack-v0')
observation = env.reset()
print(env.action_space, env.observation_space, sep='\n')

This code returns the sum of players’ current points ∈ {0,1 , 31}, the sum of the dealer’s up cards ∈ {1 , 10}, and whether the player has ACE (no = 0, yes = 1), and the agent can perform two potential actions: stick = 0, hit = 1.

This experiment uses on policy first visit MC control, on policy method to solve the hypothesis of exploring starts to a certain extent, so that the strategy is both green and explorative, and the final strategy is also optimal to a certain extent. As shown in the figure below:

图 1 On-policy first-visit MC control

图 2 树

We define a nested function

def make_epsilon_greedy_policy(Q_table, nA, epsilon):
    def generate_policy(observation):
        prob_A = np.ones(nA) * epsilon / nA
        optimal_a = np.argmax(Q_table[observation])
        prob_A[optimal_a] += (1.0 - epsilon)
        return prob_A

    return generate_policy

MC algorithm is scene by scene, so we need to generate a scene data according to the strategy.

Note here: generate_ Policy is a function called make_ epsilon_ greedy_ The return value of policy. generate_ The return value of policy is π – PI π. There are 1000 loops here just to make sure you get a complete scene.

Next is the main part of MC control. We need to cycle enough times to make the value function converge. Each cycle first generates a scene sample sequence according to the strategy, then traverses each “state value” binary, and uses the average value of all first visit returns as the estimation

Note here: return and count are dictionaries, and each “state value” binary is a key. The return of each scene of the binary is its value. With more and more iterations, according to the law of large numbers, its average value will converge to its expected value. And when the next iteration generates another scene sample sequence, generate_ The policy function will be based on Q_ Table update.

def MC_control(env, iteration_times=500000, epsilon=0.1, discount_factor=1.0):
    Return, Count, Q_table = defaultdict(float), defaultdict(float), defaultdict(lambda: np.zeros(env.action_space.n))
    policy = make_epsilon_greedy_policy(Q_table, env.action_space.n, epsilon)
    for i in range(iteration_times):
        if i % 1000 == 0:
            Print (STR (I) + "times")

        trajectory = generate_one_episode(env, policy)
        s_a_pairs = set([(x[0], x[1]) for x in trajectory])
        for state, action in s_a_pairs:
            s_a = (state, action)
            first_visit_id = next(i for i, x in enumerate(trajectory) if x[0] == state and x[1] == action)
            G = sum([x[2] * (discount_factor ** i) for i, x in enumerate(trajectory[first_visit_id:])])
            Return[s_a] += G
            Count[s_a] += 1.
            Q_table[state][action] = Return[s_a] / Count[s_a]
    return policy, Q_table

The next step is to visualize the value function

def plot_value_function(Q_table):
    x = np.arange(12, 21)
    y = np.arange(1, 10)
    X, Y = np.meshgrid(x, y)
    Z_noace = np.apply_along_axis(lambda x: Q_table[(x[0], x[1], False)], 2, np.dstack([X, Y]))
    Z_ace = np.apply_along_axis(lambda x: Q_table[(x[0], x[1], True)], 2, np.dstack([X, Y]))
        def plot_surface(X, Y, Z, title):
        	The code is too long

The experiment is over. To get the full code,Please visit bread moreMake a purchase

4、 Experimental results

Run the above code, in the code file RL, the output figure is as follows:

图 3 not use Ace

图 4 use ace