Low-Resource Active Learning of Morphological Segmentation

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.author Grönroos, Stig-Arne
dc.contributor.author Hiovain, Katri
dc.contributor.author Smit, Peter
dc.contributor.author Rauhala, Ilona
dc.contributor.author Jokinen, Kristiina
dc.contributor.author Kurimo, Mikko
dc.contributor.author Virpioja, Sami
dc.date.accessioned 2017-05-11T09:09:52Z
dc.date.available 2017-05-11T09:09:52Z
dc.date.issued 2016
dc.identifier.citation Grönroos , S-A , Hiovain , K , Smit , P , Rauhala , I , Jokinen , K , Kurimo , M & Virpioja , S 2016 , ' Low-Resource Active Learning of Morphological Segmentation ' NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY , vol 4 , 4 , pp. 47-72 . DOI: 10.3384/nejlt.2000-1533.1644 en
dc.identifier.other PURE UUID: de2319c1-df1d-4694-9fe9-70fc49598b57
dc.identifier.other PURE ITEMURL: https://research.aalto.fi/en/publications/lowresource-active-learning-of-morphological-segmentation(de2319c1-df1d-4694-9fe9-70fc49598b57).html
dc.identifier.other PURE FILEURL: https://research.aalto.fi/files/11718564/gronroos_et_al_nejlt16v4a4.pdf
dc.identifier.uri https://aaltodoc.aalto.fi/handle/123456789/25909
dc.description.abstract Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. en
dc.format.extent 26
dc.format.extent 47-72
dc.format.mimetype application/pdf
dc.language.iso en en
dc.relation.ispartofseries Volume 4 en
dc.rights openAccess en
dc.subject.other 113 Computer and information sciences en
dc.title Low-Resource Active Learning of Morphological Segmentation en
dc.type A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä fi
dc.description.version Peer reviewed en
dc.contributor.department Department of Signal Processing and Acoustics
dc.contributor.department University of Helsinki
dc.contributor.department Department of Computer Science
dc.subject.keyword 113 Computer and information sciences
dc.identifier.urn URN:NBN:fi:aalto-201705114284
dc.identifier.doi 10.3384/nejlt.2000-1533.1644
dc.type.version publishedVersion

Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search archive

Advanced Search

article-iconSubmit a publication


My Account