Building an auto-scaling and fault tolerant system using RabbitMQ and KEDA

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis

Department

Mcode

Language

en

Pages

48

Series

Abstract

The proliferation of Large Language Models (LLMs) has revolutionized AI applications, yet their deployment faces significant challenges in achieving optimal latency and throughput due to their massive computational demands, network delays, and hardware limitations. This thesis addresses these issues by proposing and evaluating an asynchronous processing architecture integrated with automatic horizontal scaling within a Kubernetes cluster. Leveraging RabbitMQ for reliable message queuing and KEDA for event-driven autoscaling based on queue metrics, the study aimed to construct a fault-tolerant and highly scalable LLM inference system. Through empirical experimentation and performance evaluations, the research demonstrates the system's feasibility in mitigating latency and throughput challenges. However, the findings critically highlight that the full realization of performance gains and efficient resource utilization is contingent upon the meticulous configuration and precise tuning of all system components. This work serves as a comprehensive guideline for implementing and optimizing auto-scaling systems for high-performance LLM applications.

Description

Supervisor

Ylä-Jääski, Antti

Thesis advisor

Lundell, Tommi

Other note

Citation