Evaluating dynamic batching strategies for energy-efficient inference serving: A performance study

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Electrical Engineering | Master's thesis

Department

Mcode

Language

en

Pages

46

Series

Abstract

The integration of artificial intelligence (AI) into real-time applications at scale has transferred the performance bottlenecks from model training to inference serving, emphasizing the emerging relevance of energy-efficient inference serving. While mainstream inference serving frameworks such as NVIDIA Triton support dynamic batching strategies to optimize latency and throughput, their significance for energy efficiency remains fairly underexplored. This study evaluates the impact of dynamic batching mechanisms on trade-offs between energy consumption and performance in GPU-oriented inference serving environments. Employing a systematic experimental approach with commonly used deep learning models, namely ResNet50 and MobileNet, this research quantifies the influence of dynamic batching on vital performance metrics such as latency, throughput, power consumption, and GPU utilization, across various workload profiles and batching configurations. Outcomes reflect that despite aggressive dynamic batching configurations can minimize average energy consumption and enhance GPU utilization, they introduce latency trade-offs, especially for lightweight models and low-traffic scenarios. By analysing the batching approach, workload condition, system-level telemetry, and batching frequency, the results offer practical insights into inference deployment optimization toward both performance and sustainability. This study contributes to the broader discourse on sustainable AI by presenting substantial evidence and design mechanisms for energy-efficient inference serving in industry-level deployments.

Description

Supervisor

Premsankar, Gopika

Other note

Citation