Robust Proximal Policy Optimization for Reinforcement Learning

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorBabadi, Amin
dc.contributor.advisorZhao, Yi
dc.contributor.authorMoazzeni Bikani, Pooya
dc.contributor.schoolSähkötekniikan korkeakoulufi
dc.contributor.supervisorPajarinen, Joni
dc.date.accessioned2022-10-23T17:06:19Z
dc.date.available2022-10-23T17:06:19Z
dc.date.issued2022-10-17
dc.description.abstractReinforcement learning is a family of machine learning algorithms, in which the system learns to make sequential optimal decisions by interacting with the environment. Reinforcement learning problems are modelled by the Markov Decision Process, which is identified by its transition probability and reward function. Most of the reinforcement algorithms are designed under the assumption that the transition probability and reward function do not vary over time. However, this is not inline with the real-world targets, as the environment is subject to change. This will impose more challenges for the system (agent) to learn the optimal policy and act accordingly. This scenario is known as non-stationary reinforcement learning, where the characteristics of the environment changes from design to deployment and over time. This work begins by providing a review of policy gradient methods that exploit function approximation and are suitable for large state and action space problems. Then, a robust algorithm based on Proximal Policy Optimization (PPO) actor-critic algorithm is proposed to address the non-stationary reinforcement learning problem. This algorithm is tested on various reinforcement learning simulation environments and compared with several baselines including PPO.en
dc.format.extent55+5
dc.format.mimetypeen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/117373
dc.identifier.urnURN:NBN:fi:aalto-202210236159
dc.language.isoenen
dc.locationP1fi
dc.programmeCCIS - Master’s Programme in Computer, Communication and Information Sciences (TS2013)fi
dc.programme.majorCommunications Engineeringfi
dc.programme.mcodeELEC3029fi
dc.subject.keywordreinforcement learningen
dc.subject.keywordnon-stationary environmenten
dc.subject.keywordproximal policy optimizationen
dc.subject.keywordtrust region policy optimizationen
dc.subject.keywordrobust proximal policy optimizationen
dc.titleRobust Proximal Policy Optimization for Reinforcement Learningen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessno

Files