Risk Estimation Using Offline Reinforcement Learning in the Football Domain
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Sähkötekniikan korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2022-08-22
Department
Major/Subject
Autonomous Systems
Mcode
ELEC3055
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
94+0
Series
Abstract
Nowadays, an interest in Machine Learning has grown in many sectors. Complex challenges have been solved using Reinforcement Learning (RL) implementations. However, due to its exploratory nature, RL suffers from data inefficiency and cannot guarantee safety in many complex tasks. In this thesis, a novel approach called OfSaCRE is proposed, aiming to reduce the number of constraint violations an agent commits in deployment. First, an Offline Safety Critic, encoding the risk estimation, is obtained using a dataset from past transitions and Offline RL techniques. Then, the Offline Safety Critic is deployed together with an RL agent using a safety control module, which decides the final action to be taken based on the estimated safety of each action. In addition, an alternative training architecture to enable the usage of OfSaCRE during learning is explored, penalizing the usage of the safety critic in the RL agent. In football, statistics confirm that teams with more ball possession have a better chance to win the match. Therefore, not losing the ball is of utmost importance. This thesis measures the effects of OfSaCRE in the football domain, where in this work the constraint is defined as losing the ball. The performance using different Offline RL algorithms and the addition of noise in the used datasets is analyzed. The results showed that Offline DQN and noisy dataset are the most adequate algorithm and dataset type for this application, since they reduce the number of constraint violations without punishing too much the reward exploitation. In the training architecture, the results indicated that the number of constraint violations is reduced by more than a half but at the cost of not learning any useful behaviour to exploit the reward.Description
Supervisor
Kyrki, VilleThesis advisor
Dzibela, DanielKeywords
reinforcement learning, safe reinforcement learning, offline reinforcement learning, risk estimation, football simulation, safety critic