Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model
Loading...
Access rights
openAccess
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Date
2022
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
10
Series
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, pp. 68-77
Abstract
In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.Description
Keywords
Other note
Citation
Rastas, I, Ciarán Ryan, Y, Tiihonen, I, Mohammadnia Qaraei, M, Repo, L, Babbar, R, Mäkelä, E, Tolonen, M & Ginter, F 2022, Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model . in Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change . Association for Computational Linguistics, pp. 68-77, Workshop on Computational Approaches to Historical Language Change, Dublin, Ireland, 26/05/2022 . https://doi.org/10.18653/v1/2022.lchange-1.7