We've trained an agent to achieve a high score of 74,500 on Montezuma's Revenge from a single human demonstration, better than any previously published result. Our algorithm is simple: the agent plays a sequence of games starting from carefully chosen states from the demonstration, and learns from them by optimizing the game score using PPO, the same reinforcement learning algorithm that underpins OpenAI Five.
Our agent playing Montezuma’s Revenge. The agent achieves a final score of 74,500 over approximately 12 minutes of play (video is double speed). Although much of the agent’s game mirrors our demonstration, the agent surpasses the demonstration score of 71,500 by picking up more diamonds along the way. In addition, the agent learns to exploit a flaw in the emulator to make a key re-appear at minute 4:25 of the video, something not present in the demonstration.
Exploration and Learning
In order to succeed at a reinforcement learning problem, an AI needs to do two things:
- Find a sequence of actions that leads to positive reward. This is the exploration problem.
- Remember the sequence of actions to take, and generalize to related but slightly different situations. This is the learning problem.
The exploration problem can largely be bypassed in Montezuma's Revenge by starting each RL episode by resetting from a state in a demonstration. By starting from demonstration states, the agent needs to perform much less exploration to learn to play the game compared to when it starts from the beginning of the game at every episode. Doing so enables us to disentangle exploration and learning. Our results suggest that exploration is the hardest of the two problems for games like Montezuma’s Revenge and certain similar Atari games like PrivateEye.
Why Exploration is Difficult
Model-free RL methods like policy gradients and Q-learning explore by taking actions randomly. If, by chance, the random actions lead to a reward, they are reinforced, and the agent becomes more likely to take these beneficial actions in the future. This works well if rewards are dense enough for random actions to lead to a reward with reasonable probability. However, many of the more complicated games require long sequences of very specific actions to experience any reward, and such sequences are extremely unlikely to occur randomly.
An agent playing Montezuma’s Revenge by taking random actions. As this (sped-up) video shows, random exploration is very unlikely to lead to an agent experiencing positive reward in Montezuma’s Revenge.
Consider a game which takes a precise sequence of N actions to experience the first reward. If each of those actions is taken with a fixed probability, a random agent will need to play the game for a duration that scales as exp(N) before it can expect to experience the first reward.
In the case of Montezuma’s Revenge, for example, the probability of getting the first key can be decomposed as
p(get key) = p(get down ladder 1) * p(get down rope) * p(get down ladder 2) * p(jump over skull) * p(get up ladder 3).
By multiplying N of these probabilities together, we end up with the resulting probability p(get key) that is exponentially smaller than any of the individual input probabilities. Algorithms with exponential scaling break down very quickly as your problem becomes more challenging, which limits the tasks current reinforcement learning techniques can solve.
Simplifying Exploration with Demonstrations
Although model-free RL methods have difficulty finding long sequences of actions, they work well for shorter sequences. Our main insight is that we can make our task easier to solve by decomposing it into a curriculum of subtasks requiring short action sequences; we construct this curriculum by starting each RL episode from a demonstration state. A variant of the same idea was used recently for reverse curriculum generation for robotics, where a curriculum was constructed by iteratively perturbing a set of starting states using random actions, and selecting the resulting states with the right level of difficulty.
Our approach works by letting each RL episode start from a state in a previously recorded demonstration. Early on in training, the agent begins every episode near the end of the demonstration. Once the agent is able to beat or at least tie the score of the demonstrator on the remaining part of the game in at least 20% of the rollouts, we slowly move the starting point back in time. We keep doing this until the agent is playing from the start of the game, without using the demo at all, at which point we have an RL-trained agent beating or tying the human expert on the entire game.
By slowly moving our starting state from the end of the demonstration to the beginning, we ensure that at every point the agent faces an easy exploration problem where it is likely to succeed, since it has already learned to solve most of the remaining game. We can interpret solving the RL problem in this way as a form of dynamic programming. If a specific sequence of N actions is required to reach a reward, this sequence may now be learned in a time that is linear in N, rather than exponential.
Starting episodes by resetting from demonstration states was previously proposed, but without constructing a curriculum that gradually moves the starting state back from the end of the demonstration to the beginning. When combined with imitation learning, several researchers report benefit from this approach. For our use case we found such a curriculum to be vitally important for deriving benefit from the demonstration.
Impression of our agent learning to reach the first key in Montezuma’s Revenge using RL and starting each episode from a demonstration state. When our agent starts playing the game, we place it right in front of the key, requiring it to only take a single jump to find success. After our agent has learned to do this consistently, we slowly move the starting point back in time. Our agent might then find itself halfway up the ladder that leads to the key. Once it learns to climb the ladder from there, we can have it start at the point where it needs to jump over the skull. After it learns to do that, we can have it start on the rope leading to the floor of the room, etc. Eventually, the agent starts in the original starting state of the game and is able to reach the key completely by itself.
Comparison to imitation-based approaches
Recently, DeepMind has shown an agent learning Montezuma's Revenge by imitation learning from a demonstration; one approach trains an agent to achieve the same states seen in a YouTube video of Montezuma's Revenge, and another technique combines a sophisticated version of Q-learning with maximizing the likelihood of actions taken in a demonstration. The advantage of these approaches is that they do not require as much control over the environment our technique does: they do not reset the environment to states other than the starting state of the game, and they do not presume access to the full game states encountered in the demonstration. Our method differs by directly optimizing what we care about — the game score, rather than making the agent imitate the demonstration; our method will thus not overfit to a potentially sub-optimal demonstration and could offer benefits in multi-player games where we want to optimize performance against other opponents than just the one from the demonstration.
Although the step-by-step learning done by our agent is much simpler than learning to play from scratch, it is still far from trivial. One challenge our RL agent faces is that it is generally unable to reach the exact state from later on in the demo when it starts from an earlier state. This is because the agent plays the game at a different frameskip from what we used for recording the demonstration, but it is also due to the randomness in the actions which make it very unlikely to exactly reproduce any specific sequence of actions. The agent will thus need to be able to generalize between states that are very similar, but not identical. We found that this works well for Montezuma’s Revenge, but much less well for some other Atari games we tried, like Gravitar and Pitfall. One reason for this may be that these latter games require solving a harder vision problem: we found these games difficult to play from a downsampled screen ourselves, and we saw some improvement when using larger and deeper neural network policies.
Another challenge we encountered is that standard RL algorithms like policy gradients require striking a careful balance between exploration and exploitation: if the agent's actions are too random, it makes too many mistakes to ever achieve the required final score when starting from the beginning of the game; if the actions are too deterministic, the agent stops learning because it does not explore alternative actions. Achieving the reported result on Montezuma's Revenge thus required careful tuning of the coefficient of the entropy bonus used in PPO, in combination with other hyperparameters such as the learning rate and the scaling of rewards. For some other games like Gravitar and Pitfall we were unable to find hyperparameters that worked for training the full curriculum. The algorithm also still shows substantial random variation from run to run, with some runs failing to converge for Montezuma's Revenge. We hope that future advances in RL will yield algorithms that are more robust to random noise and to the choice of hyperparameters.
Finally, like is often the case in reinforcement learning, we find that our trained neural net policy does not yet generalize at the level of a human player. One method to test for generalization ability is to perturb the policy by making actions sticky and repeating the last action with probability of 0.25 at every frame. Using this evaluation method our trained policy obtains a score of 10,000 on Montezuma's Revenge on average. Alternatively, we can take random actions with probability 0.01 (repeated for 4 frameskipped steps), which leads to an average score of 8,400 for our policy. Anecdotally, we find that such perturbations also significantly reduce the score of human players on Montezuma's Revenge, but to a lesser extent. As far as we are aware, our results using perturbed policies are still better than all those published previously. Perturbing the learned policy by starting with between 0 and 30 random no-ops did not significantly hurt results, with the majority of rollouts achieving the final score obtained in our demonstration.
Where most previous work on learning from demonstrations focused on imitation, which encourages identical behavior to that seen in the demonstration, we have shown that good results can be achieved by optimizing returns directly. This allows the agent to deviate from the demonstrated behavior, which lets it find new and exciting solutions that the human demonstrator may not have considered. By training on a curriculum of subtasks, created by resetting from demonstration states, we used this technique to solve a difficult reinforcement learning problem requiring long sequences of actions.