Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model
Repo Liina; Tiihonen Iiro; Ginter Filip; Babbar Rohit; Tolonen Mikko; Rastas Iiro; Qaraei Mohammedreza; Mäkelä Eetu; Ryan Yann
Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model
Repo Liina
Tiihonen Iiro
Ginter Filip
Babbar Rohit
Tolonen Mikko
Rastas Iiro
Qaraei Mohammedreza
Mäkelä Eetu
Ryan Yann
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2022110164053
https://urn.fi/URN:NBN:fi-fe2022110164053
Tiivistelmä
In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.
Kokoelmat
- Rinnakkaistallenteet [19207]