Domain adapting LLMs for cyberecurity awareness
| dc.contributor | Aalto-yliopisto | fi |
| dc.contributor | Aalto University | en |
| dc.contributor.advisor | Papadimitratos, Panagiotis | |
| dc.contributor.advisor | Hussain, Ahmed | |
| dc.contributor.author | Salahuddin, Salahuddin | |
| dc.contributor.school | Perustieteiden korkeakoulu | fi |
| dc.contributor.school | School of Science | en |
| dc.contributor.supervisor | Hellas, Arto | |
| dc.date.accessioned | 2025-10-20T17:01:38Z | |
| dc.date.available | 2025-10-20T17:01:38Z | |
| dc.date.issued | 2025-09-26 | |
| dc.description.abstract | While Large Language Models (LLMs) has shown exceptional performance in natural language, it struggles with domain-specialized queries. This thesis investigates the effectiveness of Domain-Adaptive Continuous Pretraining (DAP) for enhancing cybersecurity awareness of three open-source pretrained LLMs—Llama-3.1-8B, DeepSeek-Distill-Qwen-14B, and Llama-3.3-70B—on a relatively small domain- specific corpus (1M, 50M, 118.8M). The adapted models are evaluated against their base counterparts and the cybersecurity LLM baseline, Llama-Primus-Base (8B parameters, 2.77B tokens). Across three benchmarks—CTI-MCQ, CyberMetric, and SecEVal—the DAP models outperformed base models and Llama-Primus-Base, with the 70B model demonstrating better results than the open-source baseline models. These results indicate that DAP can enhance LLMs cybersecurity understanding with a small dataset size and no Supervised Fine-Tuning (SFT)/Reinforcement Learning with Human Feedback (RLHF), | en |
| dc.format.extent | 52 | |
| dc.format.mimetype | application/pdf | en |
| dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/140103 | |
| dc.identifier.urn | URN:NBN:fi:aalto-202510208272 | |
| dc.language.iso | en | en |
| dc.programme | Master's Programme in Security and Cloud Computing | en |
| dc.programme.major | Security and Cloud Computing | en |
| dc.subject.keyword | generative AI | en |
| dc.subject.keyword | cybersecurity | en |
| dc.subject.keyword | large language models | en |
| dc.subject.keyword | domain adaptive continuous pretraining | en |
| dc.subject.keyword | threat intelligence | en |
| dc.subject.keyword | foundation models | en |
| dc.title | Domain adapting LLMs for cyberecurity awareness | en |
| dc.type | G2 Pro gradu, diplomityö | fi |
| dc.type.ontasot | Master's thesis | en |
| dc.type.ontasot | Diplomityö | fi |
| local.aalto.electroniconly | yes | |
| local.aalto.openaccess | no |