Leveraging Uncertainty for Finnish L2 Speech Scoring with LLMs
Loading...
Access rights
openAccess
CC BY-NC-ND
CC BY-NC-ND
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Date
Major/Subject
Mcode
Degree programme
Language
en
Pages
9
Series
The Workshop on Automatic Assessment of Atypical Speech (AAAS-2025). Proceedings of the Workshop
Abstract
Automatic speech assessment (ASA) supports learning but often requires extensive data, which is scarce for languages with fewer learners. Recent research shows that Large Language Models (LLMs) can generalize to new tasks with minimal training data using in-context learning (ICL). We find LLMs effective in estimating the proficiency of individuals learning Finnish as a second language (L2) when given a few examples of human expert grading. The proficiency grades produced by the model, when evaluating verbatim transcripts from an automatic speech recognition (ASR) system, agree with human ratings at a level comparable to the agreement between the human raters. Our experiments reveal that adding more grading demonstrations in ICL improves the model’s accuracy but, counterintuitively, increases its uncertainty when selecting an appropriate proficiency level. We show that this uncertainty can be leveraged further by creating soft labels: instead of assigning the most probable level (hard label), we aggregate the model’s confidence across all possible levels, resulting in noticeable performance improvements. Further analysis reveals that the sources of model uncertainty differ across ICL settings. In zero-shot, uncertainty stems from intrinsic response properties, such as proficiency level. In few-shot, it is driven by the relationship between the sample and the demonstrations.Description
Keywords
Other note
Citation
Voskoboinik, E, Phan, N, Grósz, T & Kurimo, M 2025, Leveraging Uncertainty for Finnish L2 Speech Scoring with LLMs. in The Workshop on Automatic Assessment of Atypical Speech (AAAS-2025). Proceedings of the Workshop. University of Tartu Library, Workshop on Automatic Assessment of Atypical Speech, Tallinn, Estonia, 05/03/2025. < https://hdl.handle.net/10062/107137 >