Author attribution as a privacy adversary: An evaluation of re-identification risk in transcribed speech

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Electrical Engineering | Master's thesis

Department

Mcode

Language

en

Pages

66

Series

Abstract

Automatic transcription introduces a neglected privacy risk: it is possible to re-identify speakers from the linguistic content of speech alone. This thesis evaluates that risk by treating state-of-the-art Author Attribution (AA) models as privacy adversaries across three settings and five datasets. In closed-set conditions on datasets with consistent user topics, models attain Macro F1-scores above 0.90. In open-set few-shot conditions, they correctly re-identify unseen speakers from as few as five transcripts, reaching a peak Macro F1-score of 0.87. A compositional analysis reveals the linguistic fingerprint is a composite of topic and linguistic style, with the latter potentially providing a stable cue for re-identification under topic consistent. This distinction has critical implications for anonymization. While named-entity masking alone fails to provide meaningful protection, paraphrasing with a large language model, which alters style, reduces adversary performance by more than 50\% on conversational data. We conclude that robust privacy for transcribed speech requires style-aware anonymization and systematic evaluation against text-based re-identification, extending efforts beyond the acoustic signal to the linguistic channel.

Description

Supervisor

Bäckström, Tom

Thesis advisor

Truong, Lucy
Rech, Silas

Other note

Citation