Real-Time GPU Usage Alert Service on Pre-Exascale HPC Clusters
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2024-07-31
Department
Major/Subject
Security and Cloud Computing
Mcode
SCI3113
Degree programme
Master’s Programme in Security and Cloud Computing (SECCLO)
Language
en
Pages
75 + 20
Series
Abstract
Improving observability in large-scale distributed computing clusters has always been a complex problem, particularly in High-Performance Computing (HPC). Despite the growing popularity of GPU-accelerated jobs, traditional workload managers in HPC systems, such as Slurm, lack the feature for collecting GPU usage history at job levels. In addition, with increasing workloads that rely on extensive GPU resources, especially AI training jobs, GPUs have become the most power-consuming hardware in the HPC system, so it's essential to reduce resource waste on these devices. To address these issues, in this master's thesis, we design and implement a real-time GPU usage alert service on top of the Slurm-based job monitoring system for supercomputer systems, i.e., Puhti, Mahti, and the pre-exascale supercomputer LUMI (the fastest supercomputer in Europe according to TOP500 by June 2024) at the CSC - IT Center for Science. We aim to have complete control over the data pipeline and tailor it to fit the characteristics of HPC systems so that it can be performant. As a result, we design our own GPU monitoring metrics collection infrastructure from the libraries provided by multiple GPU vendors and an in-memory real-time alert status checker service with the help of database triggers and LISTEN & NOTIFY. We also develop an alert algorithm to spot inefficient jobs with a bit of usage. In addition, we benchmarked the alert service with random data under extreme conditions designed for pre-exascale supercomputers, and the whole system was stable enough. Finally, we deployed the entire system in production for Puhti and Mahti, and it had been working well for months before we submitted the thesis. The outcome of this master's thesis empowers supercomputer administrators at CSC - IT Center for Science to learn about sub-optimal GPU resource utilization for specific jobs in real-time, finding out the cause for them, thus improving energy efficiency and significantly reducing resource waste in HPC clusters.Description
Supervisor
Zhao, BoThesis advisor
Ilvonen, SamiAlfthan, Sebastian von
Keywords
HPC, GPU, monitoring system, alert service, observability