aalto1 untyped-item.component.html

Minimizing blast radius of chaos engineering experiments via steady-state metrics forecasting

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Sähkötekniikan korkeakoulu | Master's thesis

Department

Major/Subject

Mcode

ELEC3055

Language

en

Pages

61 + 3

Series

Abstract

Chaos Engineering (CE) intentionally disrupts distributed systems by introducing faults into the system to better understand and improve their resilience. By studying these intentional disruptions, CE provides insights that help enhance system performance and the overall user experience. However, two main challenges exist: reducing the negative impact or "blast radius" of these CE experiments without diluting the value of the CE experiment and identifying a standardized set of metrics to monitor during such CE experiments. This research addresses these challenges by monitoring application and system-level metrics known as the Golden Signals, and a steady-state metric called the Apdex score during a CE experiment. Using Pearson and Spearman correlation analyses alongside Granger Causality tests, a strong connection between the Golden Signals and Apdex score is identified. The study also introduces a new health-check system design that uses the Apdex score to automatically stop a CE experiment if a preset threshold is violated. Furthermore, the design also introduces a method for early termination of the CE experiment based on forecasted Apdex scores. This method not only limits potential system damage but also reveals key system weaknesses, striking a balance between risk and discovery.

Description

Supervisor

Monperrus, Martin

Thesis advisor

Ron Arteaga, Javier
Stenbock, Tore

Other note

Citation

Endorsement

Review

Supplemented By

Referenced By