Ml agents project practice (I)

Time:2021-9-19

This article was first published in:Walker AI

Reinforcement learning is a kind of problem in machine learning and artificial intelligence. It studies how to achieve a specific goal through a series of sequential decisions. It is a kind of algorithm, which enables the computer to realize that it doesn’t understand anything from the beginning and has no idea in its head. Through continuous attempts, it learns from mistakes, finally finds the law and learns the method to achieve the goal. This is a complete reinforcement learning process. Here we can quote the figure below for a more intuitive explanation.

Ml agents project practice (I)

Agent is an agent, that is, our algorithm, which appears in the form of player in the game. The agent outputs an action through a series of strategies to act on the environment, and the environment returns the state value after action, that is, the observation and reward value in the figure. When the environment returns the reward value to the agent, it updates its status, and the agent obtains a new observation.

1. ml-agents

1.1 INTRODUCTION

At present, most unity games have a large number, perfect engine and good training environment. Because unity can cross platforms, it can be converted into webgl and published on the web page after training on windows and Linux platforms. Mlagents is an open source plug-in of unity, which allows developers to train in the environment of unity, even without writing Python code and deep understanding of PPO, sac and other algorithms. As long as developers configure parameters, they can easily use reinforcement learning algorithm to train their own model.

If you are interested in algorithms, please click here to learn algorithms PPO and sac.

<u> To learn more, click to < / u >

1.2 installation of anaconda, tensorflow and tensorboard

The ML agents introduced in this article need to communicate with tensorflow through python. During training, information such as observation, action, reward and done are obtained from the unity end of ML agents and transferred to tensorflow for training, and then the decision of the model is transferred to unity. Therefore, before installing ml agents, you need to install tensorflow according to the following link.

Tensorboard facilitates data visualization and analysis of whether the model meets expectations.

Click to install details

1.3 ml agents installation steps

(1) Go to GitHub and download ml agents (release 6 version is adopted for this example)

GitHub can be downloaded

Ml agents project practice (I)

(2) Unzip the compressed package andcom.unity.ml-agentscom.unity.ml-agents.extensionsPut it into the packages directory of unity (if not, please create one), andmanifest.jsonAdd these two directories to.

Ml agents project practice (I)

(3) After the installation is completed, import it into the project, create a new script, and enter the following references to verify that the installation is successful

using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using Unity.MLAgents.Policies;

public class MyAgent : Agent

{

}

2. Ml agents training example

2.1 overview and project

Ml agents project practice (I)

Environment is usually described by Markov process. Agent generates action by taking some policy, and interacts with environment to generate a reward. Then, the agent adjusts and optimizes the current policy according to the reward.

In this example, the actual project refers to the Xiaole rules and scores can be obtained by collecting three same colors. In this example, the additional rewards of four consecutive colors and multiple consecutive colors are removed (to facilitate the design environment)

Ml agents project practice (I)

Click to download the project instance

For the export part of unity project, please refer to the official website and click to go to.

The following will share the methods of project practice from four angles: interface separation, selection algorithm, design environment and parameter adjustment.

2.2 game framework AI interface separation

Separate the interfaces required by the observation and action of the project from the game. It is used to pass in the current state of the game and the action of executing the game.

static List<ML_Unit> states = new List<ML_Unit>();

public class ML_Unit
{
    public int color = (int)CodeColor.ColorType.MaxNum;
    public int widthIndex = -1;
    public int heightIndex = -1;
}
//From the current screen, get the information of all squares, including the position x (length), position Y (height), and color (the zero point of the coordinate axis is on the top left)
public static List<ML_Unit> GetStates()
{
    states.Clear();
    var xx = GameMgr.Instance.GetGameStates();
    for(int i = 0; i < num_widthMax;i++)
    {
        for(int j = 0; j < num_heightMax; j++)
        {
            ML_Unit tempUnit = new ML_Unit();
            try
            {
                tempUnit.color = (int)xx[i, j].getColorComponent.getColor;
            }
            catch
            {
                Debug.LogError($"GetStates i:{i} j:{j}");
            }
            tempUnit.widthIndex = xx[i, j].X;
            tempUnit.heightIndex = xx[i, j].Y;
            states.Add(tempUnit);
        }
    }
    return states;
}

public enum MoveDir
{
    up,
    right,
    down,
    left,
}

public static bool CheckMoveValid(int widthIndex, int heigtIndex, int dir)
{
    var valid = true;
    if (widthIndex == 0 && dir == (int)MoveDir.left)
    {
        valid = false;
    }
    if (widthIndex == num_widthMax - 1 && dir == (int)MoveDir.right)
    {
        valid = false;
    }

    if (heigtIndex == 0 && dir == (int)MoveDir.up)
    {
        valid = false;
    }

    if (heigtIndex == num_heightMax - 1 && dir == (int)MoveDir.down)
    {
        valid = false;
    }
    return valid;
}

//The interface for executing the action calls the game logic to move the box according to the position information and moving direction. Widthindex 0-13, heigtindex 0-6, dir 0-3 0 upper 1 right 2 lower 3 left
public static void SetAction(int widthIndex,int heigtIndex,int dir,bool immediately)
{
    if (CheckMoveValid(widthIndex, heigtIndex, dir))
    {
        GameMgr.Instance.ExcuteAction(widthIndex, heigtIndex, dir, immediately);
    }
}

2.3 game AI algorithm selection

In the first topic of reinforcement learning project, in the face of many algorithms, choosing an appropriate algorithm can get twice the result with half the effort. If you are not familiar with the characteristics of the algorithm, you can directly use the PPO and sac provided by ml agents.

In this example, the author first used the PPO algorithm and tried a lot of adjustments. The average 9 steps can be one step right, and the effect is relatively bad.

Later, the environment of the game was carefully analyzed. Due to the three consumer games of the project, the environment of each time was completely different. The results of each step had little impact on the next step, and the demand for Markov chain was not strong. Because PPO is a policy based algorithm of onPolicy, the policy is updated very carefully every time, which makes the result difficult to converge (the author tried XX cloth, but still did not converge).

Compared with the value base algorithm of offpolicy, dqn can collect a large number of environment parameters, establish qtable, and gradually find the maximum value of the corresponding environment.

In short, PPO is online learning. After running hundreds of steps each time, go back and learn where the hundreds of steps are right and wrong, and then update the learning, and then run hundreds of steps again and again. In this way, the learning efficiency is slow, and it is difficult to find the global optimal solution.

Dqn is offline learning, which can run hundreds of millions of steps, then go back and take out all these places to learn, and then it is easy to find the global optimal point.

(this example uses PPO for demonstration, and later shares the ML agents external algorithm, uses the external tool stable_baselines3, and uses the dqn algorithm for training)

2.4 game AI design environment

After we have determined the algorithm framework, how to design observation, action and reward has become the decisive factor to determine the training effect. In this game, the environment here mainly has two variables, one is the position of the box, the other is the color of the box.

–Observation:

For the above figure, our example has 14 lengths, 7 widths and 6 colors.

The swish used by ml agents is used as the activation function, and a small floating-point number (- 10F ~ 10F) can be used. However, in order to make the agents obtain a cleaner environment and better training effect, we still need to encode the environment.

In this example, the author uses onehot to encode the environment and locate the coordinate zero point in the upper left corner. In this way, the environment code of the cyan square in the upper left corner can be expressed as long [0,0,0,0,0,0,0,0,0,0,1],

High [0,0,0,0,0,0,1], the color is processed according to the fixed enumeration (yellow, green, purple, pink, blue, red) color [0,0,0,1,0].

The environment contains (14 + 7 + 6) in total14 * 7 = 2646

Code example:

public class MyAgent : Agent
{
    static List<ML_Unit> states = new List<ML_Unit>();
    public class ML_Unit
    {
        public int color = (int)CodeColor.ColorType.MaxNum;
        public int widthIndex = -1;
        public int heightIndex = -1;
    }

    public static List<ML_Unit> GetStates()
    {
        states.Clear();
        var xx = GameMgr.Instance.GetGameStates();
        for(int i = 0; i < num_widthMax;i++)
        {
            for(int j = 0; j < num_heightMax; j++)
            {
                ML_Unit tempUnit = new ML_Unit();
                try
                {
                    tempUnit.color = (int)xx[i, j].getColorComponent.getColor;
                }
                catch
                {
                    Debug.LogError($"GetStates i:{i} j:{j}");
                }
                tempUnit.widthIndex = xx[i, j].X;
                tempUnit.heightIndex = xx[i, j].Y;
                states.Add(tempUnit);
            }
        }
        return states;
    }

    List<ML_Unit> curStates = new List<ML_Unit>();
    public override void CollectObservations(VectorSensor sensor)
    {
        //It is necessary to judge whether the block movement is over and whether the block settlement is over
        var receiveReward = GameMgr.Instance.CanGetState();
        var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();
        if (!codeMoveOver || !receiveReward)
        {
            return;
        }

        //Get the status information of the environment
        curStates = MlagentsMgr.GetStates();
        for (int i = 0; i < curStates.Count; i++)
        {
            sensor.AddOneHotObservation(curStates[i].widthIndex, MlagentsMgr.num_widthMax);
            sensor.AddOneHotObservation(curStates[i].heightIndex, MlagentsMgr.num_heightMax);
            sensor.AddOneHotObservation(curStates[i].color, (int)CodeColor.ColorType.MaxNum);
        }
    }
}

–Action:

Each block can move up, down, left and right. The minimum information we need to record includes 14 * 7 blocks, and each block can move in 4 directions. In this example, the directions are enumerated (up, right, down, left).

The upper left corner is the zero point, and the cyan square in the upper left corner occupies the first four actions of the action, which are (the cyan square in the upper left corner moves upward, the cyan square in the upper left corner moves to the right, and the cyan square in the upper left corner moves downward,

The cyan square in the upper left corner moves to the left).

Then the action contains a total of 14 7 4 = 392

Careful readers may find that the cyan box in the upper left corner can’t move up and left. At this time, we need to set actionmask to shield these prohibited actions in the rules.

Code example:

public class MyAgent : Agent
{
    public enum MoveDir
    {
        up,
        right,
        down,
        left,
    }


    public void DecomposeAction(int actionId,out int width,out int height,out int dir)
    {
        width = actionId / (num_heightMax * num_dirMax);
        height = actionId % (num_heightMax * num_dirMax) / num_dirMax;
        dir = actionId % (num_heightMax * num_dirMax) % num_dirMax;
    }

    //Execute the action and get the reward for the action
    public override void OnActionReceived(float[] vectorAction)
    {
        //It is necessary to judge whether the block movement is over and whether the block settlement is over
        var receiveReward = GameMgr.Instance.CanGetState();
        var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();
        if (!codeMoveOver || !receiveReward)
        {
            Debug.LogError($"OnActionReceived CanGetState = {GameMgr.Instance.CanGetState()}");
            return;
        }

        if (invalidNums.Contains((int)vectorAction[0]))
        {
            //For the call of block settlement, you can get rewards here (here is punishment, because it is in the shielding action, all actions will be called during training, and this logic will not be entered during non training)
            GameMgr.Instance.OnGirdChangeOver?.Invoke(true, -5, false, false);
        }
        DecomposeAction((int)vectorAction[0], out int widthIndex, out int heightIndex, out int dirIndex);
        //Here, go back and perform the action, move the corresponding box in the corresponding direction. After execution, you will be rewarded and reset the scene according to the situation
        MlagentsMgr.SetAction(widthIndex, heightIndex, dirIndex, false);
    }

    //After mlagentsmgr.setaction is called, it will enter this function after executing the action
    public void RewardShape(int score)
    {
        //Calculate the reward obtained
        var reward = (float)score * rewardScaler;
        AddReward(reward);
        //Add the data into tensorboard for statistical analysis
        Mlstatistics.AddCumulativeReward(StatisticsType.action, reward);
        //Each step contains the action of punishment, which can improve the efficiency of exploration
        var punish = -1f / MaxStep * punishScaler;
        AddReward(punish);
        //Add the data into tensorboard for statistical analysis
        Mlstatistics.AddCumulativeReward( StatisticsType.punishment, punish);
    }

    //Set action mask
    public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker)
    {
        // Mask the necessary actions if selected by the user.
        checkinfo.Clear();
        invalidNums.Clear();
        int invalidNumber = -1;
        for (int i = 0; i < MlagentsMgr.num_widthMax;i++)
        {
            for (int j = 0; j < MlagentsMgr.num_heightMax; j++)
            {
                if (i == 0)
                {
                    invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.left;
                    actionMasker.SetMask(0, new[] { invalidNumber });
                }
                if (i == num_widthMax - 1)
                {
                    invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.right;
                    actionMasker.SetMask(0, new[] { invalidNumber });
                }

                if (j == 0)
                {
                    invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.up;
                    actionMasker.SetMask(0, new[] { invalidNumber });
                }

                if (j == num_heightMax - 1)
                {
                    invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.down;
                    actionMasker.SetMask(0, new[] { invalidNumber });
                }
            }
        }
    }
}

A large number of CO processes are used in the elimination process of the original project, which has a high delay. We need to squeeze out the delayed time when we need retraining.

In order not to affect the main logic of the game, generally, the filltime in the yield return new waitforseconds (filltime) in the collaboration process is changed to 0.001f, so that the reward can be obtained as soon as the action is selected in the model without greatly modifying the game logic.

public class MyAgent : Agent
{
    private void FixedUpdate()
    {
        var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();
        var receiveReward = GameMgr.Instance.CanGetState();
        if (!codeMoveOver || !receiveReward /*||!MlagentsMgr.b_isTrain*/)
        {        
            return;
        }
        //Because there is a collaboration process that requires waiting time, it needs to wait for a reward to be generated before requesting a decision. Therefore, the decisionrequester provided with ML agents cannot be used
        RequestDecision();
    }
}

2.5 parameter adjustment

After designing the model, we first run a preliminary version to see how different the results are from our design expectations.

First, configure the yaml file to initialize the parameters of the network:

behaviors:
SanXiaoAgent:
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
learning_rate: 0.0005
beta: 0.005
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
vis_encode_type: simple
memory: null
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
init_path: null
keep_checkpoints: 25
checkpoint_interval: 100000
max_steps: 1000000
time_horizon: 128
summary_freq: 1000
threaded: true
self_play: null
behavioral_cloning: null
framework: tensorflow

Please refer to the officially provided interface for training code. In this example, release 6 version is used, and the command is as follows

mlagents-learn config/ppo/sanxiao.yaml --env=G:\mylab\ml-agent-buildprojects\sanxiao\windows\display1001display\fangkuaixiaoxiaole --run-id=121001xxl --train --width 800 --height 600 --num-envs 2 --force --initialize-from=121001

After the training, open anaconda and enter tensorboard — logdir = results — port = 6006 in the ML agents project home directory to copy http://PS20190711FUOV:6006/ Open it on the browser to see the training results.

(mlagents) PS G:\mylab\ml-agents-release_6> tensorboard --logdir=results --port=6006
TensorBoard 1.14.0 at http://PS20190711FUOV:6006/ (Press CTRL+C to quit)

The training effect diagram is as follows:

Ml agents project practice (I)

Move count is the average number of steps needed to eliminate a square. It takes about 9 cloth to take the correct step. When using actionmask, you can eliminate one square in about 6 steps.

–Reward:

Check the average value of the reward design according to the reward in the table above. The author likes to control between 0.5 and 2. If it is too large or too small, you can adjust the rewardscaler.

//After mlagentsmgr.setaction is called, it will enter this function after executing the action
public void RewardShape(int score)
{
    //Calculate the reward obtained
    var reward = (float)score * rewardScaler;
    AddReward(reward);
    //Add the data into tensorboard for statistical analysis
    Mlstatistics.AddCumulativeReward(StatisticsType.action, reward);
    //Each step contains the action of punishment, which can improve the efficiency of exploration
    var punish = -1f / MaxStep * punishScaler;
    AddReward(punish);
    //Add the data into tensorboard for statistical analysis
    Mlstatistics.AddCumulativeReward( StatisticsType.punishment, punish);
}

3. Summary and remarks

At present, the official practice of ML agents uses imitation learning and expert data in the training network.

The author tries PPO in this example, which has a certain effect. However, PPO is difficult to train for the three elimination at present, which is difficult to converge and find the global optimum.

Setting the environment and reward requires rigorous testing, otherwise it will cause great errors to the results and it is difficult to check.

At present, the iterative algorithm of reinforcement learning is relatively fast. If there are mistakes above, please correct them and make progress together.

Due to the limited space, we can’t release all the codes of the whole project. If you are interested in research, you can leave a message below. I can send you the complete project by email.

Later, we will share the ML agents external algorithm and use the external tool stable_ Baselines3, using dqn algorithm to train.