RL training

We follow the pipeline of Unity ML-agents plugin. During each iteration, we observe the behaviors of RL agents and change the parameters upon the observations. This section introduces the fundamental pipeline of Unity ML-agents and detailed explanations of each training iteration.

Pipeline
Training Iterations

Pipeline

Introduction

We used the Unity ML Agents plugin to incorporate Reinforcement Learning into our project. This section will present the different components of the plugin, how we specifically used them to train our agents, as well as a basic overview of our training process.

Terms

Area

Area defines the training environment’s initialization, reset specifics, and other environment-related features.

In our game, the environment is the pound that the hippos reside in. At initialization, this class will randomly place fish, hippos, and the crocodile within the environment. A list of fishes and a list of hippos track the objects present in the environment and handle properly in the event of adding and deleting animals.

When the reset condition is triggered, it properly handles it by deleting all remaining elements in the environment and randomly placing new ones just as the initialization process.

Agent

Agent is attached to a GameObject that is designed to be the RL agent. It defines the agent’s rewards, observations, and actions. An agent is always associated with a brain.

Actions

The actions define what the RL agent is able to do. In our game, the hippos and the crocodile can only perform three actions: advance, turn left and turn right. This simple action allows them to explore the space and trigger the different rewards within the environment.

Observations

The observations defines what information the agent can receive in order to make its decisions. In our game, each hippo has a 180-degree ray perception that lets them recognize a predefined set of objects: fish, crocodile, rocks, and borders. The ray perception will return the closest valid object hit, their distance and direction relative to the hippo. The hippos know the relative direction and distance of the crocodile. They also know whether the crocodile is currently eating a hippo or not.

Rewards

The rewards is a crucial decision factor for the agents. We define the encouraged, and discouraged actions then assign specific values to them. It heavily influences the training results as the agent’s sole goal is to maximize its rewards. In our game, the crocodile is awarded when biting a hippo, and completely eating one. The hippo is awarded when eating fish as well as helping another bitten hippo. It is punished if it approaches the crocodile whenever it is not eating another hippo.

Academy

An academy orchestrates the agents and their decision making processes. It also handles the communication between the editor and the python API. We would drag the agents’ brains(hippo learning brain and crocodile learning brain) into the academy, and whenever we want to train a particular brain, we would tick the checkbox next to it. The academy also handles setting the curricula’s parameters. In our case, that would be the speed of the crocodile.

Brain

A brain controls the actual decisions of the agent according to observations. Roughly, agent sets the rules, and the brain executes according to the rules and the received information in different situations.

Player Brain

A player brain helps the developer to simulate the actions of an agent by mapping them to the keyboard. It permits developers to test out the agent script: whether the rewards are triggered in the correct situation.

Learning Brain

A learning brain is the actual one used to do the training. It can also host a pre-trained model to be integrated into the gameplay.

Curricula

A curricula is like a lesson plan during training. We define a set of rewards, goals, and parameters. Once a reward goal is achieved, we then move on to the next lesson, where usually the parameters and reward goals will be adjusted to greater difficulty. In our game, we define the curricula’s parameter is the speed of the crocodile. During training, it will gradually increase. We observed that establishing a curricula helps the training result as it avoids the hippo from failing too fast and allows it to improve gradually.

Training parameters

The training parameters are specific numbers used for training. They can greatly improve the training results or shorten the training time but at the cost of many trials and errors.

Training Process

After defining all of the above terms. We would duplicate to around eight training areas to speed up the process. We would make sure they check the brain we want to train in the academy. We would type in a few commands in the terminal to associate the curricula and training id we want to use. After making sure that communication is established, we finally click play in the editor to kick off the training.

After the training, it would generate a .nn file, which is the pre-trained model, that we can drag onto the specified brain and test the results.

For more details on the training process and terms, please consult Unity ML-Agents’ documentation.

Training Iterations

Introduction

The following sections record every major training iterations we have made to train our hippos. It includes a detailed record of each parameter, our observations and our reasoning for the next iteration

Environment

Hippo Number: 3

Crocodile Number: 1

Fish Number: 15

Crocodile Rewards

We are using a pre-trained crocodile agent to train against the hippo agents. Its rewards are the following: Bite Hippo: 0.1 Eat Hippo: 1

Hippo Speed

Hippo speed: 2

Iteration 1: Abusive Hippos

Reasoning

We want to train the hippo to be able to actively seek for fish but also help save another hippo when it is bitten by the crocodile. This is the first iteration, and the numbers are set in a more or less random manner.

Rewards

Eat Fish:1

Bitten: -1

Save other hippos: 2

Eaten: -1

Enter Crocodile Zone (Bigger collider around crocodile) when the crocodile is not eating: -0.1

Time

Struggle Time (Time between crocodile biting a hippo and the hippo being eaten): 10s

Crocodile Freeze Time (Time when a hippo saves the bitten hippo, the time that the crocodile can’t move): 2s

Curricula

No curricula Crocodile speed: 1.5

Training Time

20 ~ 30mins

Observation

At first, the crocodile was able to eat all hippos. However, towards the end of the training, the hippos learned to game the system. They would gather around the crocodile so they can keep saving each other and getting the big reward.

(Click on the video)

Iteration 2: Selfish Hippos

Reasoning

From previous training, though interesting, the hippos would game the system and not really behave in a more dynamic way. We suspect that was due to the really high reward associated with saving other hippos and the relatively small punishment for being bitten. In order to prevent them from abusing the system and viewing the helping mechanics as the best strategy. We tried to tweak the data in the following way:

Decrease the Saving reward
Increase the bitten punishment
Decrease the struggle time (to make it less easy to save)
Decrease the freeze time
Increase Fish reward to give them more incentive to search for food as their primary behavior.”

Rewards

Eat Fish:1.3

Bitten: -1.5

Save other hippos: 1.5

Eaten: -1

Enter Crocodile Zone (Bigger collider around crocodile) when the crocodile is not eating: -0.1

Time

Struggle Time (Time between crocodile biting a hippo and the hippo being eaten): 7s

Crocodile Freeze Time (Time when a hippo saves the bitten hippo, the time that the crocodile can’t move): 2s

Curricula

“thresholds”: [ -0.1, 0.7, 1.7, 3, 1.7, 2.7, 2.7, 4.5, 5], “crocodile_speed”: [ 0.1, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.5, 2, 2 ]

Training Time

2.5h

Observation

The hippo would run away from crocodiles when approaching. It will seek food in safe areas(further from crocodile). At some point, they would hide in the corner of the island to hide from the crocodile. They will still try to go out for fish but seem pretty reluctant. The helping mechanism is rarely triggered.

(Click on the video)

Iteration 3: Extreme Rewards and Punishments

Reasoning

We want to maximize the punishment of being bitten and also the rewards of being saved at the same time to see whether it would be different from the first iteration.

Rewards

Eat Fish:1.3

Bitten: -2

Save other hippos: 2.5

Eaten: -1

Enter Crocodile Zone (Bigger collider around crocodile) when the crocodile is not eating: -0.1

Time

Struggle Time (Time between crocodile biting a hippo and the hippo being eaten): 7s

Crocodile Freeze Time (Time when a hippo saves the bitten hippo, the time that the crocodile can’t move): 2s

Curricula

“thresholds”: [ -0.1, 1.7, 3, -0.1, 1.7, 3, -0.1, 1.7, 3], “crocodile_speed”: [ 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.5, 1.5, 1.5 ]

Training Time

100mins

Observation

Different from the previous training, we try to increase the rewards by saving the other hippos as well as giving more punishment while being bitten by the crocodile. We want to see whether hippos will tend to help the hippos more or to hide from the crocodiles. The results show that even though sometimes hippos are hiding from the crocodile, hipps tend to get around the crocodile and abuse it more. Because the positions of different animals are randomly generated, when 2 out of 3 hippos were eaten by the crocodile at the beginning, the last hippo will not last long.

Iteration 4: Same Rewards

Reasoning

We want to see what decision the hippo will make when the rewards of eating a fish are exactly the same as saving a hippo. However, we found out that if this is the case, the hippo will not help with each other anymore. They are more focused on self-saving, hiding from the crocodile, and running away from the crocodile. This training shows the selfishness of the hippos, and the result is not desirable.

Rewards

Eat Fish:1.5

Bitten: -1

Save other hippos: 1.5

Eaten: -1

Enter Crocodile Zone (Bigger collider around crocodile) when the crocodile is not eating: -0.1

Time

Struggle Time (Time between crocodile biting a hippo and the hippo being eaten): 7s

Crocodile Freeze Time (Time when a hippo saves the bitten hippo, the time that the crocodile can’t move): 2s

Curricula

“thresholds”: [ -0.1, 1.7, 3, -0.1, 1.7, 3, -0.1, 1.7, 3],
“crocodile_speed”: [ 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.5, 1.5, 1.5 ]

Training Time

45mins

Observation

Iteration 5: Well balanced

Reasoning

Using the iteration 2’s training result, though the hippos would runaway from hippos, they would rarely help each other out They also seem very reluctant to go out for food when the crocodile is around and tend to hide in corners. We want to tweak the rewards to encourage the helping mechanism still but also be braver when seeking out for food. We also trained it on the new map, where there are more rock obstacles to see how they perform.

We increased the struggle time and crocodile freeze time slightly to give them enough time to rescue other hippos
We increased the reward of helping back to 2 to encourage more helping
The punishment of being bitten remains the same

Rewards

Eat Fish:1.3

Bitten: -1.5

Save other hippos: 2

Eaten: -1

Enter Crocodile Zone (Bigger collider around crocodile) when the crocodile is not eating: -0.1

Time

Struggle Time (Time between crocodile biting a hippo and the hippo being eaten): 7s

Crocodile Freeze Time (Time when a hippo saves the bitten hippo, the time that the crocodile can’t move): 2s

Curricula

“thresholds”: [ -0.1, 1.7, 3, -0.1, 1.7, 3, -0.1, 1.7, 3], “crocodile_speed”: [ 0.1, 0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0 ]

Training Time

1h40min

Observation

The hippo has very interesting behaviors under different situations. They are pretty good at finding and collecting the fish. They would also run away from the crocodile when it is chasing them and hide behind rocks When a hippo is bitten, it is more likely to have a nearby hippo help it out. If the other hippos are too far away, they tend not to choose to approach and attempt to help. If they are gathered in a corner, they would be more likely to temporarily “abuse” the crocodile(as in, the go in circles to keep gathering rewards) many of the time caused by one of the hippos being stuck in the environment’s corner. But if they end up in more open spaces, they would stop the abuse and start to flee.

The start position between the hippos and the environment would affect their decisions and behaviors.
Trained in old map

(Click on the video)

However, after we trained the hippos in the new map, the collider issue doesn’t exist any more.
Trained in new map

(Click on the video)