Advances in Weakly Supervised Learning of Morphology

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.advisor Lagus, Krista, Dr., Aalto University, Department of Computer Science, Finland
dc.contributor.author Kohonen, Oskar
dc.date.accessioned 2015-08-06T09:01:22Z
dc.date.available 2015-08-06T09:01:22Z
dc.date.issued 2015
dc.identifier.isbn 978-952-60-6271-6 (electronic)
dc.identifier.isbn 978-952-60-6270-9 (printed)
dc.identifier.issn 1799-4942 (electronic)
dc.identifier.issn 1799-4934 (printed)
dc.identifier.issn 1799-4934 (ISSN-L)
dc.identifier.uri https://aaltodoc.aalto.fi/handle/123456789/17332
dc.description.abstract Morphological analysis provides a decomposition of words into smaller constituents. It is an important problem in natural language processing (NLP), particularly for morphologically rich languages whose large vocabularies make statistical modeling difficult. Morphological analysis has traditionally been approached with rule-based methods that yield accurate results, but are expensive to produce. More recently, unsupervised machine learning methods have been shown to perform sufficiently well to benefit applications such as speech recognition and machine translation. Unsupervised methods, however, do not typically model allomorphy, that is, non-concatenative structure, for example pretty/prettier. Moreover, the accuracy of unsupervised methods remains far behind rule-based methods with the best unsupervised methods yielding between 50-66% F-score in Morpho Challenge 2010. We examine these problems with two approaches that have not previously attracted much attention in the field. First, we propose a novel extension to the popular unsupervised morphological segmentation method Morfessor Baseline to model allomorphy via the use of string transformations. Second, we examine the effect of weak supervision on accuracy by training on a small annotated data set in addition to a large unannotated data set. We propose two novel semi-supervised morphological segmentation methods, namely a semi-supervised extension of Morfessor Baseline and morphological segmentation with conditional random fields (CRF). The methods are evaluated on several languages with different morphological characteristics, including English, Estonian, Finnish, German and Turkish. The proposed methods are compared empirically to recently proposed weakly supervised methods. For the non-concatenative extension, we find that, while the string transformations identified by the model have high precision, their recall is low. In the overall evaluation the non-concatenative extension improves accuracy on English, but not on other languages. For the weak supervision we find that the semi-supervised extension of Morfessor Baseline improves the accuracy of segmentation markedly over the unsupervised baseline. We find, however, that the discriminatively trained CRFs perform even better. In the empirical comparison, the CRF approach outperforms all other approaches on all included languages. Error analysis reveals that the CRF excels especially on affix accuracy. en
dc.format.extent 148 + app. 92
dc.format.mimetype application/pdf en
dc.language.iso en en
dc.publisher Aalto University en
dc.publisher Aalto-yliopisto fi
dc.relation.ispartofseries Aalto University publication series DOCTORAL DISSERTATIONS en
dc.relation.ispartofseries 91/2015
dc.relation.haspart [Publication 1]: Oskar Kohonen, Sami Virpioja, and Mikaela Klami. Allomorfessor: Towards Unsupervised Morpheme Analysis. In Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Revised Selected Papers, volume 5706 of Lecture Notes in Computer Science, Aarhus, Denmark, pages 975-982, September 2009.
dc.relation.haspart [Publication 2]: Sami Virpioja, Oskar Kohonen, and Krista Lagus. Unsupervised Morpheme Analysis with Allomorfessor. In Multilingual Information Access Evaluation I. Text Retrieval Experiments, CLEF 2009, volume 6241 of Lecture Notes in Computer Science, Corfu, Greece, pages 609-616, September 2010.
dc.relation.haspart [Publication 3]: Sami Virpioja, Oskar Kohonen, and Krista Lagus. Evaluating the Effect of Word Frequencies in a Probabilistic Generative Model of Morphology. In Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011, Riga, Latvia, pages 230-237, May 2011.
dc.relation.haspart [Publication 4]: Oskar Kohonen, Sami Virpioja, and Krista Lagus. Semi-Supervised Learning of Concatenative Morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, Uppsala, Sweden, pages 78-86, July 2010.
dc.relation.haspart [Publication 5]: Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (CoNLL), Sofia, Bulgaria, pages 29-37, August 2013.
dc.relation.haspart [Publication 6]: Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. Painless Semi-Supervised Morphological Segmentation using Conditional Random Fields. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Sweden, pages 84-89, May 2014.
dc.relation.haspart [Publication 7]: Teemu Ruokolainen, Oskar Kohonen, Kairit Sirts, Stig-Arne Grönroos, Sami Virpioja, and Mikko Kurimo. A Comparative Study on Semi-Supervised Morphological Segmentation. Submitted, Computational Linguistics, 27 pages, 2014.
dc.subject.other Linguistics en
dc.title Advances in Weakly Supervised Learning of Morphology en
dc.type G5 Artikkeliväitöskirja fi
dc.contributor.school Perustieteiden korkeakoulu fi
dc.contributor.school School of Science en
dc.contributor.department Tietotekniikan laitos fi
dc.contributor.department Department of Computer Science en
dc.subject.keyword morphology en
dc.subject.keyword allomorphy en
dc.subject.keyword machine learning en
dc.subject.keyword unsupervised learning en
dc.subject.keyword semi-supervised learning en
dc.identifier.urn URN:ISBN:978-952-60-6271-6
dc.type.dcmitype text en
dc.type.ontasot Doctoral dissertation (article-based) en
dc.type.ontasot Väitöskirja (artikkeli) fi
dc.contributor.supervisor Oja, Erkki, Distinguished Prof. Emeritus, Aalto University, Department of Information and Computer Science, Finland
dc.opn Borin, Lars, Prof., University of Gothenburg, Sweden
dc.date.dateaccepted 2015-03-30
dc.contributor.lab Computational Cognitive Systems group en
dc.contributor.lab Laskennalliset kognitiiviset järjestelmät fi
dc.rev Dyer, Chris, Prof.
dc.rev Ginter, Filip, Dr.
dc.date.defence 2015-08-26


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search archive


Advanced Search

article-iconSubmit a publication

Browse

My Account