Log Analysis and Anomaly Detection in Log Files with Natural Language Processing Techniques
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Sähkötekniikan korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2023-10-09
Department
Major/Subject
Control, Robotics and Automation System
Mcode
ELEC3025
Degree programme
AEE - Master’s Programme in Automation and Electrical Engineering (TS2013)
Language
en
Pages
78
Series
Abstract
Log analysis is a crucial aspect of maintaining and improving the performance, security, and reliability of modern computer systems. The increasing complexity of these systems along with the exponential growth of log data has driven the need for the development of more advanced techniques for understanding and analyzing logs. In this project, we propose a log management infrastructure with Elastic Stack for statistical analysis equipped with visualization features and natural language processing (NLP) based approaches for the process of log analysis and anomaly detection. We build upon a classification model with 4 different classes on a small sampled dataset to develop a proof-of-concept (POC) to validate that the proposed solution aligns with the problem statement. We then scale up the solution to the full dataset to develop anomaly detection in real-world syslog data generated in industrial settings. This enables faster and more effective decision-making which in turn frees the human workforce from the manual repetitive process of log inspection. First, raw textual and unstructured logs in various formats and from different sources such as Continuous Integration/Continuous Deployment (CI/CD) servers, ambulatory monitoring devices, and automated test builds are collected. Then, the obtained logs are preprocessed to clean, normalize, and tokenize into tokens. The tokenization is carried out using word and sub-word tokenization techniques to obtain word and sub-word tokens respectively. The tokens are then converted into meaningful numerical representations using static and contextual word embedding algorithms such as Word2Vec, BERT, and DistilBERT pretrained models to generate word embeddings. The word embeddings are thus fed into neural networks for the classification of log lines into designated labels. The experiments performed with the combination of DistilBERT embedding model and LSTM classifier network for logs generated from patient monitoring devices achieved an accuracy of 0.99 with macro-averaged precision of 0.96, recall of 0.93 and F1-score of 0.94 in a multi-label classification. The results showed promising signs towards the automation of log analysis of syslogs generated from test-builds and patient monitoring systems.Description
Supervisor
Alku, PaavoThesis advisor
Samiee, KavehSun, Meng
Keywords
log analysis, anomaly detection, elastic stack, natural language processing, machine learning, word embeddings