PPO algorithm with custom RL environment made with Unity engine

Dhyey Thumar
Analytics Vidhya
Published in
4 min readOct 10, 2020

--

Using ML-Agents Python lower-level APIs to training RL agents

I am writing this article because I found that there are hardly any resources that talk about how to apply the RL algorithm to a custom environment made with the Unity engine. So if you have implemented a custom environment in Unity and used the ML-Agents toolkit for training. But now, if you want to apply PPO(or any other RL algorithm) to your environment (without using the ML-Agents inbuild trainer), then you are at the right place.

There are two additional features provided by ML-Agents:

  • Python lower-level APIs, which allows us to interact directly with the learning environment so we can implement new reinforcement learning algorithms and test them in our environment.
  • Gym wrapper so we can mimic the learning environment to a standard gym environment, and it can be used similarly to other gym environments.

Here I will explain how to use python APIs to interact with your learning environment using the PPO algorithm. (But similar steps can be followed for other RL algorithms)

Note: If you are looking forward to implement the PPO algorithm, then read this amazing article Proximal Policy Optimization Tutorial (Part 1/2: Actor-Critic Method)

Step 0: Installation

This step is only applicable if you haven’t installed the ML-Agents toolkit.

$ git clone --branch release_1 https://github.com/Unity-Technologies/ml-agents.git
$ python -m venv myvenv
$ myvenv\Scripts\activate
$ pip install -e ./ml-agents/ml-agents-envs

You only need to install ml-agents-envs from the cloned repo.

Step 1: Load the environment

Python-side communication happens through UnityEnvironment.

file_name is the name of the environment binary (provide the correct path to the binary). If you want to interact with the Editor then use file_name=None and press the play button in the editor when the message “Start training by pressing the Play button in the Unity Editor” is displayed on the screen.

seed indicates the seed to use when generating random numbers during the training process.

side_channels provides a way to exchange data with the Unity simulation that is not related to the reinforcement learning loop. For example, we are setting the screen dimensions and time scale.

Step 2: Get environment details

reset() sends a signal to reset the environment.

get_behavior_names() returns a list of BehaviorName. For a single agent environment, there is only 1 behavior. You can have multiple behaviors, which requires different models (in the case of the PPO algorithm you need to have a separate actor-critic model for each behavior)

get_behavior_spec() provides multiple fields

  • action_size corresponding to the number of actions your agent is expecting. (Action Space)
  • observation_shapes returns the list of tuples. It depends on how many different methods you are using to collect observations, such as the Ray-cast method or adding values using VectorSensor.
  • is_action_continuous() returns a boolean value and it depends on how you defined your action space in behavior parameters while editing the environment.
  • is_action_discrete() similar to the above method. If you are using discrete action space, then use discrete_action_branches to get a tuple of action choices. (For example, if the action_size = 2 then discrete_action_branches will return (3, 2,) which indicates 3 different values for the first branch and 2 for the second branch.

Step 3: Collecting the initial observations

get_steps() returns a tuple of DecisionSteps, TerminalSteps (for group of agents). We can also access specific agent’s DecisionStep using its id as step_result[0][agent_id]. And for TerminalStep its step_result[1][agent_id].

DecisionStep contains the following fields:

  • obs is a list of NumPy arrays observation collected by an agent. (accumulate them in a single vector)
  • reward corresponds to the rewards collected by the agent since the last step.
  • agent_id is a unique identifier for the corresponding Agent.
  • action_mask is an optional list of a one-dimensional array of boolean. Only used in a multi-discrete action space type. (If true, the action is not available for the agent during this simulation step.)

TerminalStep is empty until an agent encounters the end episode trigger & contains the following fields:

  • obs, reward, agent_id are similar to DecisionStep fields.
  • max_step is a boolean. True if the Agent reached its maximum number of steps during the last simulation step.

Step 4: Apply actions and step the environment

set_actions() this sets the actions for a whole agent group. It requires a 2D NumPy array, and its shape is defined as (num_of_agents, n_actions, ).

set_action_for_agent(agent_group: str, agent_id: int, action: np.array) this sets the action for a specific agent in an agent group. agent_group is the name of the group the agent belongs to. Here the action is a 1D NumPy array, and its shape is defined as (n_actions, ).

step() sends a signal to step the environment. When step() or reset() is called, the Unity simulation will move forward until an agent in the simulation needs input from Python to act.

Step 5: Exception handling

Unity ML-Agents also provides custom exception handlers which can be invoked when the running environment generates an error (for example, if you want to interrupt the training process then by using these exceptions, you can safely close the unity window and save the trained model).

Here we come to the end of this article. And if you have any doubts, suggestions, improvements then please do let me know.

If you want to interact with the environment mentioned in the above image, then check out this link. You can also use this simple script that doesn’t use any algorithm so you can explore ML-Agents APIs more efficiently.

Check out this github repo for the complete implementation of the PPO algorithm on a custom environment made with the Unity engine.

--

--

Dhyey Thumar
Analytics Vidhya

Working with Reinforcement Learning, Backend web development, & Computer Vision.