Decoupling popularity bias and user fairness in LLM-based recommendation systems
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Science |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
107
Series
Abstract
Large Language Models (LLMs) are rapidly being adopted as “plug-and-play” recommenders that require no task-specific training, although their recommendations can still face two long-standing problems: popularity bias (overexposing blockbusters) and consumer unfairness (unequal treatment of users who differ only in sensitive attributes). This thesis investigates whether these problems can be decoupled and simultaneously mitigated purely through prompt engineering, with no access to model weights. Working with the MovieLens-1M corpus, we generate 434,880 prompts that vary three dimensions: how a user’s historical tastes are sampled (top-rated, most recent, or a newly proposed ’polarized’ mix of likes and dislikes), whether sensitive attributes are disclosed (neutral versus gender–age, occupation, or all), and which popularity debiaser has been applied (from a hard ’exclude-popular’ order to a gentle “temporal-diverse” request). We evaluate every prompt with a triad of metrics: Hit-Rate for accuracy, log-popularity difference (LPD) for popularity bias, and Jaccard similarity for the stability of recommendations when sensitive attributes are toggled on or off. The results reveal four insights. First, supplying the LLM with a rich, polarized taste signal increases accuracy by 42%. Second, temporal diversity reduces popularity bias by 0.6 log-units while incurring only a 1% loss in accuracy, whereas hard “exclude-popular” filters decrease accuracy by up to 65%. Third, popularity bias and user fairness are orthogonal; once popularity is neutralized, adding even minimal demographic information still halves list overlap, confirming that the two dimensions must be audited separately. Finally, only one configuration, polarized sampling strategy, temporal-diverse debiaser, and attribute-neutral prompt, simultaneously satisfies strict thresholds on accuracy (HR≈0.85), popularity bias (|LPD|Description
Supervisor
Korpi-Lagg, MaaritThesis advisor
Hossein Payberah, AmirTahmasebinotarki, Shirin