Model-Based Reinforcement Learning from Pixels

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Sähkötekniikan korkeakoulu | Master's thesis
Control, Robotics and Autonomous Systems
Degree programme
AEE - Master’s Programme in Automation and Electrical Engineering (TS2013)
People learn skills by interacting with their surroundings from the time of their birth. Reinforcement learning (RL), learning a decision-making strategy (policy) to maximize a scalar reward signal by trial and error, offers such a learning paradigm to learn from surroundings. However, most of the current RL algorithms suffer from sample inefficiency: training an agent typically needs millions of samples. This thesis discusses model-based RL that is able to learn a policy to control robots from scratch with significantly fewer samples. Especially, this thesis focuses on the case where observations are high dimensional pixels. To achieve this goal, we first explain essential components to learn a latent dynamics model from high dimensional observations and to make decisions based on the learned dynamics model. Then we reproduce an algorithm called Dreamer to learn behaviors by latent imagination from pixels and test the reproduced algorithm on four benchmark tasks. Furthermore, we extend the Dreamer algorithm in two ways. The first way is decision-time policy refinement, where we refine the predicted policy by a planning algorithm named cross-entropy method (CEM). Second, we extend the flexibility of Dreamer by discretizing continuous action space. Our proposed method shows that by combining an ordinal architecture, the discrete policy can achieve similar performance on most tasks. This allows us to utilize a wide array of RL algorithms, which are previously limited in the discrete domain, to solve continuous control tasks. Finally, we discuss representation learning in reinforcement learning. And we explore the possibility of learning the dynamics model behind pixels without reconstruction by partially reproducing the MuZero algorithm. The MuZero learns a value-focused model, which represents a fully abstract dynamics model without reconstructing the observations and uses the Monte Carlo tree search (MCTS) to make decisions based on the learned model. Also, we extend the MuZero algorithm to solve a continuous control task called Cartpole-balance by discretizing the action space.
Kannala, Juho
Thesis advisor
Boney, Rinu
model-based reinforcement learning, world model, value-focused model, continuous control
Other note