Log Analysis and Anomaly Detection in Log Files with Natural Language Processing Techniques

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Sähkötekniikan korkeakoulu | Master's thesis

Date

2023-10-09

Department

Major/Subject

Control, Robotics and Automation System

Mcode

ELEC3025

Degree programme

AEE - Master’s Programme in Automation and Electrical Engineering (TS2013)

Language

en

Pages

78

Series

Abstract

Log analysis is a crucial aspect of maintaining and improving the performance, security, and reliability of modern computer systems. The increasing complexity of these systems along with the exponential growth of log data has driven the need for the development of more advanced techniques for understanding and analyzing logs. In this project, we propose a log management infrastructure with Elastic Stack for statistical analysis equipped with visualization features and natural language processing (NLP) based approaches for the process of log analysis and anomaly detection. We build upon a classification model with 4 different classes on a small sampled dataset to develop a proof-of-concept (POC) to validate that the proposed solution aligns with the problem statement. We then scale up the solution to the full dataset to develop anomaly detection in real-world syslog data generated in industrial settings. This enables faster and more effective decision-making which in turn frees the human workforce from the manual repetitive process of log inspection. First, raw textual and unstructured logs in various formats and from different sources such as Continuous Integration/Continuous Deployment (CI/CD) servers, ambulatory monitoring devices, and automated test builds are collected. Then, the obtained logs are preprocessed to clean, normalize, and tokenize into tokens. The tokenization is carried out using word and sub-word tokenization techniques to obtain word and sub-word tokens respectively. The tokens are then converted into meaningful numerical representations using static and contextual word embedding algorithms such as Word2Vec, BERT, and DistilBERT pretrained models to generate word embeddings. The word embeddings are thus fed into neural networks for the classification of log lines into designated labels. The experiments performed with the combination of DistilBERT embedding model and LSTM classifier network for logs generated from patient monitoring devices achieved an accuracy of 0.99 with macro-averaged precision of 0.96, recall of 0.93 and F1-score of 0.94 in a multi-label classification. The results showed promising signs towards the automation of log analysis of syslogs generated from test-builds and patient monitoring systems.

Description

Supervisor

Alku, Paavo

Thesis advisor

Samiee, Kaveh
Sun, Meng

Keywords

log analysis, anomaly detection, elastic stack, natural language processing, machine learning, word embeddings

Other note

Citation