Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
Loading...
Access rights
openAccess
CC BY
CC BY
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Date
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
Series
Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 2183–2192
Abstract
Large language models (LLMs) are trained using massive datasets. However, these datasets often contain undesirable content, e.g., harmful texts, personal information, and copyrighted material. To address this, machine unlearning aims to remove information from trained models. Recent work has shown that soft token attacks () can successfully extract unlearned information from LLMs. In this work, we show that s can be an inadequate tool for auditing unlearning. Using common unlearning benchmarks, i.e., Who Is Harry Potter? and TOFU, we demonstrate that, in a strong auditor setting, such attacks can elicit any information from the LLM, regardless of (1) the deployed unlearning algorithm, and (2) whether the queried content was originally present in the training corpus. Also, we show that with just a few soft tokens (1-10) can elicit random strings over 400-characters long. Thus showing that s must be used carefully to effectively audit unlearning. Example code can be found at https://github.com/IntelLabs/LLMart/tree/main/examples/unlearningDescription
Keywords
Other note
Citation
Chen, H, Szyller, S, Xu, W & Himayat, N 2025, Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models. in Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, pp. 2183–2192, Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 04/11/2025. < https://aclanthology.org/2025.findings-emnlp.117/ >