Domain adapting LLMs for cyberecurity awareness

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Science | Master's thesis

Department

Mcode

Language

en

Pages

52

Series

Abstract

While Large Language Models (LLMs) has shown exceptional performance in natural language, it struggles with domain-specialized queries. This thesis investigates the effectiveness of Domain-Adaptive Continuous Pretraining (DAP) for enhancing cybersecurity awareness of three open-source pretrained LLMs—Llama-3.1-8B, DeepSeek-Distill-Qwen-14B, and Llama-3.3-70B—on a relatively small domain- specific corpus (1M, 50M, 118.8M). The adapted models are evaluated against their base counterparts and the cybersecurity LLM baseline, Llama-Primus-Base (8B parameters, 2.77B tokens). Across three benchmarks—CTI-MCQ, CyberMetric, and SecEVal—the DAP models outperformed base models and Llama-Primus-Base, with the 70B model demonstrating better results than the open-source baseline models. These results indicate that DAP can enhance LLMs cybersecurity understanding with a small dataset size and no Supervised Fine-Tuning (SFT)/Reinforcement Learning with Human Feedback (RLHF),

Description

Supervisor

Hellas, Arto

Thesis advisor

Papadimitratos, Panagiotis
Hussain, Ahmed

Other note

Citation