Browsing by Author "Zhao, Yi"
Now showing 1 - 8 of 8
- Results Per Page
- Sort Options
- Analysis of Maximum a Posteriori Policy Optimization
Sähkötekniikan korkeakoulu | Bachelor's thesis(2022-05-15) Muhonen, Ville - Continuous Monte Carlo Graph Search
A4 Artikkeli konferenssijulkaisussa(2024) Kujanpää, Kalle; Kannala, Juho; Babadi, Amin; Ilin, Alexander; Zhao, Yi; Pajarinen, JoniOnline planning is crucial for high performance in many complex sequential decision-making tasks. Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off exploration for exploitation for efficient online planning, and it outperforms comparison methods in many discrete decision-making domains such as Go, Chess, and Shogi. Subsequently, extensions of MCTS to continuous domains have been developed. However, the inherent high branching factor and the resulting explosion of the search tree size are limiting the existing methods. To address this problem, we propose Continuous Monte Carlo Graph Search (CMCGS), an extension of MCTS to online planning in environments with continuous state and action spaces. CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance. To implement this idea, at each time step, CMCGS clusters similar states into a limited number of stochastic action bandit nodes, which produce a layered directed graph instead of an MCTS search tree. Experimental evaluation shows that CMCGS outperforms comparable planning methods in several complex continuous DeepMind Control Suite benchmarks and 2D navigation and exploration tasks with limited sample budgets. Furthermore, CMCGS can be scaled up through parallelization, and it outperforms the Cross-Entropy Method (CEM) in continuous control with learned dynamics models. - Deep Learning Methods for Cross-Modal Retrieval in Real World
Perustieteiden korkeakoulu | Master's thesis(2021-05-17) Ye, Rongtian - HSCNet++ : Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2024-07) Wang, Shuzhe; Laskar, Zakaria; Melekhov, Iaroslav; Li, Xiaotian; Zhao, Yi; Tolias, Giorgos; Kannala, JuhoVisual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The proposed method, which is an extension of HSCNet, allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image localization on the 7-Scenes, 12-Scenes, Cambridge Landmarks datasets, and the combined indoor scenes. - Model-Based Reinforcement Learning from Pixels
Sähkötekniikan korkeakoulu | Master's thesis(2020-10-19) Zhao, YiPeople learn skills by interacting with their surroundings from the time of their birth. Reinforcement learning (RL), learning a decision-making strategy (policy) to maximize a scalar reward signal by trial and error, offers such a learning paradigm to learn from surroundings. However, most of the current RL algorithms suffer from sample inefficiency: training an agent typically needs millions of samples. This thesis discusses model-based RL that is able to learn a policy to control robots from scratch with significantly fewer samples. Especially, this thesis focuses on the case where observations are high dimensional pixels. To achieve this goal, we first explain essential components to learn a latent dynamics model from high dimensional observations and to make decisions based on the learned dynamics model. Then we reproduce an algorithm called Dreamer to learn behaviors by latent imagination from pixels and test the reproduced algorithm on four benchmark tasks. Furthermore, we extend the Dreamer algorithm in two ways. The first way is decision-time policy refinement, where we refine the predicted policy by a planning algorithm named cross-entropy method (CEM). Second, we extend the flexibility of Dreamer by discretizing continuous action space. Our proposed method shows that by combining an ordinal architecture, the discrete policy can achieve similar performance on most tasks. This allows us to utilize a wide array of RL algorithms, which are previously limited in the discrete domain, to solve continuous control tasks. Finally, we discuss representation learning in reinforcement learning. And we explore the possibility of learning the dynamics model behind pixels without reconstruction by partially reproducing the MuZero algorithm. The MuZero learns a value-focused model, which represents a fully abstract dynamics model without reconstructing the observations and uses the Monte Carlo tree search (MCTS) to make decisions based on the learned model. Also, we extend the MuZero algorithm to solve a continuous control task called Cartpole-balance by discretizing the action space. - Optimistic Multi-Agent Policy Gradient
A4 Artikkeli konferenssijulkaisussa(2024) Zhao, Wenshuai; Zhao, Yi; Li, Zhiyuan; Kannala, Juho; Pajarinen, JoniRelative overgeneralization (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behaviors of other agents.No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods although these methods produce state-of-the-art results.To address this gap, we propose a general, yet simple, framework to enable optimistic updates in MAPG methods that alleviate the RO problem.Our approach involves clipping the advantage to eliminate negative values, thereby facilitating optimistic updates in MAPG.The optimism prevents individual agents from quickly converging to a local optimum.Additionally, we provide a formal analysis to show that the proposed method retains optimality at a fixed point.In extensive evaluations on a diverse set of tasks including the Multi-agent MuJoCo and Overcooked benchmarks, our method outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest. - Robust Proximal Policy Optimization for Reinforcement Learning
Sähkötekniikan korkeakoulu | Master's thesis(2022-10-17) Moazzeni Bikani, PooyaReinforcement learning is a family of machine learning algorithms, in which the system learns to make sequential optimal decisions by interacting with the environment. Reinforcement learning problems are modelled by the Markov Decision Process, which is identified by its transition probability and reward function. Most of the reinforcement algorithms are designed under the assumption that the transition probability and reward function do not vary over time. However, this is not inline with the real-world targets, as the environment is subject to change. This will impose more challenges for the system (agent) to learn the optimal policy and act accordingly. This scenario is known as non-stationary reinforcement learning, where the characteristics of the environment changes from design to deployment and over time. This work begins by providing a review of policy gradient methods that exploit function approximation and are suitable for large state and action space problems. Then, a robust algorithm based on Proximal Policy Optimization (PPO) actor-critic algorithm is proposed to address the non-stationary reinforcement learning problem. This algorithm is tested on various reinforcement learning simulation environments and compared with several baselines including PPO. - Simplified Temporal Consistency Reinforcement Learning
A4 Artikkeli konferenssijulkaisussa(2023-07) Zhao, Yi; Zhao, Wenshuai; Boney, Rinu; Kannala, Juho; Pajarinen, JoniReinforcement learning (RL) is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this paper we show that, surprisingly, a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. This applies when using pure planning with a dynamics model conditioned on the representation, but, also when utilizing the representation as policy and value function features in model-free RL. In experiments, our approach learns an accurate dynamics model to solve challenging high-dimensional locomotion tasks with online planners while being 4.1× faster to train compared to ensemble-based methods. With model-free RL without planning, especially on high-dimensional tasks, such as the Deepmind Control Suite Humanoid and Dog tasks, our approach outperforms model-free methods by a large margin and matches model-based methods’ sample efficiency while training 2.4× faster.