Domain adapting LLMs for cyberecurity awareness
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Science |
Master's thesis
Authors
Date
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
52
Series
Abstract
While Large Language Models (LLMs) has shown exceptional performance in natural language, it struggles with domain-specialized queries. This thesis investigates the effectiveness of Domain-Adaptive Continuous Pretraining (DAP) for enhancing cybersecurity awareness of three open-source pretrained LLMs—Llama-3.1-8B, DeepSeek-Distill-Qwen-14B, and Llama-3.3-70B—on a relatively small domain- specific corpus (1M, 50M, 118.8M). The adapted models are evaluated against their base counterparts and the cybersecurity LLM baseline, Llama-Primus-Base (8B parameters, 2.77B tokens). Across three benchmarks—CTI-MCQ, CyberMetric, and SecEVal—the DAP models outperformed base models and Llama-Primus-Base, with the 70B model demonstrating better results than the open-source baseline models. These results indicate that DAP can enhance LLMs cybersecurity understanding with a small dataset size and no Supervised Fine-Tuning (SFT)/Reinforcement Learning with Human Feedback (RLHF),Description
Supervisor
Hellas, ArtoThesis advisor
Papadimitratos, PanagiotisHussain, Ahmed