Low-Resource Active Learning of Morphological Segmentation

Loading...
Thumbnail Image
Access rights
openAccess
Journal Title
Journal ISSN
Volume Title
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä
This publication is imported from Aalto University research portal.
View publication in the Research portal
View/Open full text file from the Research portal
Date
2016
Major/Subject
Mcode
Degree programme
Language
en
Pages
26
47-72
Series
NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY, Volume 4
Abstract
Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.
Description
Keywords
Other note
Citation
Grönroos , S-A , Hiovain , K , Smit , P , Rauhala , I , Jokinen , K , Kurimo , M & Virpioja , S 2016 , ' Low-Resource Active Learning of Morphological Segmentation ' , NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY , vol. 4 , 4 , pp. 47-72 . https://doi.org/10.3384/nejlt.2000-1533.1644