Real-Time GPU Usage Alert Service on Pre-Exascale HPC Clusters

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2024-07-31

Department

Major/Subject

Security and Cloud Computing

Mcode

SCI3113

Degree programme

Master’s Programme in Security and Cloud Computing (SECCLO)

Language

en

Pages

75 + 20

Series

Abstract

Improving observability in large-scale distributed computing clusters has always been a complex problem, particularly in High-Performance Computing (HPC). Despite the growing popularity of GPU-accelerated jobs, traditional workload managers in HPC systems, such as Slurm, lack the feature for collecting GPU usage history at job levels. In addition, with increasing workloads that rely on extensive GPU resources, especially AI training jobs, GPUs have become the most power-consuming hardware in the HPC system, so it's essential to reduce resource waste on these devices. To address these issues, in this master's thesis, we design and implement a real-time GPU usage alert service on top of the Slurm-based job monitoring system for supercomputer systems, i.e., Puhti, Mahti, and the pre-exascale supercomputer LUMI (the fastest supercomputer in Europe according to TOP500 by June 2024) at the CSC - IT Center for Science. We aim to have complete control over the data pipeline and tailor it to fit the characteristics of HPC systems so that it can be performant. As a result, we design our own GPU monitoring metrics collection infrastructure from the libraries provided by multiple GPU vendors and an in-memory real-time alert status checker service with the help of database triggers and LISTEN & NOTIFY. We also develop an alert algorithm to spot inefficient jobs with a bit of usage. In addition, we benchmarked the alert service with random data under extreme conditions designed for pre-exascale supercomputers, and the whole system was stable enough. Finally, we deployed the entire system in production for Puhti and Mahti, and it had been working well for months before we submitted the thesis. The outcome of this master's thesis empowers supercomputer administrators at CSC - IT Center for Science to learn about sub-optimal GPU resource utilization for specific jobs in real-time, finding out the cause for them, thus improving energy efficiency and significantly reducing resource waste in HPC clusters.

Description

Supervisor

Zhao, Bo

Thesis advisor

Ilvonen, Sami
Alfthan, Sebastian von

Keywords

HPC, GPU, monitoring system, alert service, observability

Other note

Citation