Real-Time GPU Usage Alert Service on Pre-Exascale HPC Clusters

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorIlvonen, Sami
dc.contributor.advisorAlfthan, Sebastian von
dc.contributor.authorJiang, Songlin
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.supervisorZhao, Bo
dc.date.accessioned2025-01-12T17:30:42Z
dc.date.available2025-01-12T17:30:42Z
dc.date.issued2024-07-31
dc.description.abstractImproving observability in large-scale distributed computing clusters has always been a complex problem, particularly in High-Performance Computing (HPC). Despite the growing popularity of GPU-accelerated jobs, traditional workload managers in HPC systems, such as Slurm, lack the feature for collecting GPU usage history at job levels. In addition, with increasing workloads that rely on extensive GPU resources, especially AI training jobs, GPUs have become the most power-consuming hardware in the HPC system, so it's essential to reduce resource waste on these devices. To address these issues, in this master's thesis, we design and implement a real-time GPU usage alert service on top of the Slurm-based job monitoring system for supercomputer systems, i.e., Puhti, Mahti, and the pre-exascale supercomputer LUMI (the fastest supercomputer in Europe according to TOP500 by June 2024) at the CSC - IT Center for Science. We aim to have complete control over the data pipeline and tailor it to fit the characteristics of HPC systems so that it can be performant. As a result, we design our own GPU monitoring metrics collection infrastructure from the libraries provided by multiple GPU vendors and an in-memory real-time alert status checker service with the help of database triggers and LISTEN & NOTIFY. We also develop an alert algorithm to spot inefficient jobs with a bit of usage. In addition, we benchmarked the alert service with random data under extreme conditions designed for pre-exascale supercomputers, and the whole system was stable enough. Finally, we deployed the entire system in production for Puhti and Mahti, and it had been working well for months before we submitted the thesis. The outcome of this master's thesis empowers supercomputer administrators at CSC - IT Center for Science to learn about sub-optimal GPU resource utilization for specific jobs in real-time, finding out the cause for them, thus improving energy efficiency and significantly reducing resource waste in HPC clusters.en
dc.format.extent75 + 20
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/132845
dc.identifier.urnURN:NBN:fi:aalto-202501121140
dc.language.isoenen
dc.programmeMaster’s Programme in Security and Cloud Computing (SECCLO)fi
dc.programme.majorSecurity and Cloud Computingfi
dc.programme.mcodeSCI3113fi
dc.subject.keywordHPCen
dc.subject.keywordGPUen
dc.subject.keywordmonitoring systemen
dc.subject.keywordalert serviceen
dc.subject.keywordobservabilityen
dc.titleReal-Time GPU Usage Alert Service on Pre-Exascale HPC Clustersen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessyes

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
master_Jiang_Songlin_2024.pdf
Size:
19.08 MB
Format:
Adobe Portable Document Format