In the latest ML-Agents blog post, we announced new features for authoring cooperative behaviors with reinforcement learning. Today, we are excited to share a new environment to further demonstrate what ML-Agents can do. DodgeBall is a competitive team vs team shooter-like environment where agents compete in rounds of Elimination or Capture the Flag. The environment is open-source, so be sure to check out the repo.
The recent addition of the MA-POCA algorithm in ML-Agents allows anyone to train cooperative behaviors for groups of agents. This novel algorithm is an implementation of centralized learning with decentralized execution. A centralized critic (neural network) processes the states of all agents in the group to estimate how well the agents are doing, while several decentralized actors (one per agent) control the agents. This allows each agent to make decisions based only on what it perceives locally, and simultaneously evaluate how good its behavior is in the context of the whole group. The diagram below illustrates MA-POCA’s centralized learning and decentralized execution.
One of the novelties of the MA-POCA algorithm is that it uses a special type of neural network architecture called attention networks that can process a non-fixed number of inputs. This means that the centralized critic can process any number of agents, which is why MA-POCA is particularly well-suited for cooperative behaviors in games. It allows agents to be added or removed from a group at any point – just as video game characters can be eliminated or spawn in the middle of a team fight. MA-POCA is also designed so that agents can make decisions for the benefit of the team, even if it is to their own detriment. This altruistic behavior is difficult to achieve with a hand-coded behavior but can be learned based on how useful the last action of an agent was for the overall success of the group. Finally, many multi-agent reinforcement learning algorithms assume that all agents choose their next action at the same time, but in real games with numerous agents, it is usually better to have them make decisions at different times to avoid frame drop. That’s why MA-POCA does not make these assumptions, and will still work even if the agents’ decisions in a single group are not in sync. In order for us to show you how well MA-POCA works in games, we created the DodgeBall environment – a fun team vs team game with an AI fully trained using ML-Agents.
The DodgeBall environment is a third-person shooter where players try to pick up as many balls as they can, then throw them at their opponents. It comprises two game modes: Elimination and Capture the Flag. In Elimination, each group tries to eliminate all members of the other group – two hits, and they’re out. In Capture the Flag, players try to steal the other team’s flag and bring it back to their base (they can only score when their own flag is still safe at their base). In this mode, getting hit by a ball means dropping the flag and being stunned for ten seconds, before returning to base. In both modes, players can hold up to four balls, and dash to dodge incoming balls and go through hedges.
In reinforcement learning, agents observe the environment and take actions to maximize a reward. The observations, actions, and rewards for training agents to play DodgeBall are described below.
In DodgeBall, the agents observe their environment through the following three sources of information:
The DodgeBall environment also makes use of hybrid actions, which are a mix of continuous and discrete actions. The agent has three continuous actions for movement: One is to move forward, another is to move sideways, and the last is to rotate. At the same time, there are two discrete actions: One to throw a ball and another to dash. This action space corresponds to the actions that a human player can perform in both the Capture the Flag and Elimination scenarios.
Meanwhile, we intentionally ensure that rewards given to the agents are rather simple. We give a large, final reward for winning and losing, and a few intermediate rewards for learning how to play the game.
For Capture the Flag:
While it is tempting to give agents many small rewards to encourage desirable behaviors, we must avoid overprescribing the strategy that agents should pursue. For instance, if we gave a reward for picking up balls in Elimination, agents might focus solely on picking up balls rather than hitting their opponents. By making our rewards as “sparse” as possible, the agents are free to discover their own strategies in the game, even if it prolongs the training period.
Because there are so many different possible winning strategies that can earn agents these rewards, we had to determine what optimal behaviors would look like. For instance, would the best strategy be to hoard the balls or move them around to conveniently grab later? Would it be wise to stick together as a team, or spread out to find the enemy faster? The answers to these questions were dependent on game design choices that we made: If balls were scarce, agents would hold on to them longer to prevent the enemies from getting them. If agents were allowed to know where the enemy was at all times, they would stay together as a group as much as possible. That said, when we wanted to make changes to the game, we did not have to make any code changes to the AI. We simply retrained a new behavior that would adapt to the new environment.
Compared to training a single agent to solve a task, it is more complex to train a group of agents to cooperate. In order to help manage a group of agents, we created the DodgeBallGameController.cs script. This script serves to initialize and reset the playground (this includes spawning the balls and resetting the agents’ positions). It assigns agents to their SimpleMultiAgentGroup and manages the rewards that each group receives. For example, this is how the DodgeBallGameController.cs script handles an agent hitting another with a ball.
In this code, the ball thrower is given a small reward for hitting an opponent – but only once the last opponent is eliminated will the whole group be rewarded for their collective effort.
MA-POCA handles agents in a SimpleMultiAgentGroup differently than it does individual agents. MA-POCA pools their observations together to train in a centralized manner. It also handles the rewards given to the whole group, in addition to the individual rewards – no matter how many agents join or leave the group. You can monitor the cumulative rewards that agents receive as a group in TensorBoard.
Since both Elimination and Capture the Flag are adversarial games, we combined MA-POCA with self-play to pitch agents against older versions of themselves and learn how to beat them. As with any self-play run in ML-Agents, we can monitor the agents’ learning progress by making sure the ELO continues to increase. After tens of millions of steps, the agents can play as well as any of us.
This video shows how the agents progress over time when learning to play Elimination. You can see that, early into the training, the agents learn to shoot but have poor aim and tend to shoot at random. After 40 million timesteps, the agents’ aim improves, though they still wander somewhat randomly in hopes of running into an enemy. When they do meet an opponent, they typically engage them one-on-one. Finally, after another 120 million timesteps of training, the agents become much more aggressive and confident and develop sophisticated strategies, such as charging into enemy territory as a group.
And here are the agents learning how to play Capture the Flag: Early in the training, at 14 million steps, the agents learn to shoot each other, without actually capturing the flag. At 30 million, the agents learn how to pick up the enemy flag and return to base, but other than the flag-carrying agent, it’s not clear how the other agents contribute. By 80 million timesteps, however, the agents exhibit interesting strategies.
Agents who aren’t holding the enemy flag will sometimes guard their own base, chase down an enemy who has their flag, or wait in the enemy’s base for the flag-bearer to return and pummel them with balls. If they have a flag, the agent might wait at their own base until their teammates can retrieve the flag so they can score. The following video highlights some of the interesting emergent strategies that the agents have learned. Note that we never explicitly specified these behaviors – they were learned over the course of hundreds of iterations of self-play.
The DodgeBall environment is open source and available to download here. We’d love for you to try it out. If you’d like to work on this exciting intersection of Machine Learning and Games, we are hiring for several positions and encourage you to apply here.
Finally, we would love to hear your feedback. For any feedback regarding the Unity ML-Agents toolkit, please fill out the following survey or email us directly at firstname.lastname@example.org. If you encounter any issues, do not hesitate to reach out to us on the ML-Agents GitHub issues page. For any other general comments or questions, please let us know on the Unity ML-Agents forums.