Author attribution as a privacy adversary: An evaluation of re-identification risk in transcribed speech
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Electrical Engineering |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
Department
Major/Subject
Mcode
Language
en
Pages
66
Series
Abstract
Automatic transcription introduces a neglected privacy risk: it is possible to re-identify speakers from the linguistic content of speech alone. This thesis evaluates that risk by treating state-of-the-art Author Attribution (AA) models as privacy adversaries across three settings and five datasets. In closed-set conditions on datasets with consistent user topics, models attain Macro F1-scores above 0.90. In open-set few-shot conditions, they correctly re-identify unseen speakers from as few as five transcripts, reaching a peak Macro F1-score of 0.87. A compositional analysis reveals the linguistic fingerprint is a composite of topic and linguistic style, with the latter potentially providing a stable cue for re-identification under topic consistent. This distinction has critical implications for anonymization. While named-entity masking alone fails to provide meaningful protection, paraphrasing with a large language model, which alters style, reduces adversary performance by more than 50\% on conversational data. We conclude that robust privacy for transcribed speech requires style-aware anonymization and systematic evaluation against text-based re-identification, extending efforts beyond the acoustic signal to the linguistic channel.Description
Supervisor
Bäckström, TomThesis advisor
Truong, LucyRech, Silas