Induction of the morphology of natural language : unsupervised morpheme segmentation with application to automatic speech recognition

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorCreutz, Mathias
dc.contributor.departmentDepartment of Computer Science and Engineeringen
dc.contributor.departmentTietotekniikan osastofi
dc.contributor.labLaboratory of Computer and Information Scienceen
dc.contributor.labInformaatiotekniikan laboratoriofi
dc.date.accessioned2012-02-17T07:46:59Z
dc.date.available2012-02-17T07:46:59Z
dc.date.issued2006-06-15
dc.description.abstractIn order to develop computer applications that successfully process natural language data (text and speech), one needs good models of the vocabulary and grammar of as many languages as possible. According to standard linguistic theory, words consist of morphemes, which are the smallest individually meaningful elements in a language. Since an immense number of word forms can be constructed by combining a limited set of morphemes, the capability of understanding and producing new word forms depends on knowing which morphemes are involved (e.g., "water, water+s, water+y, water+less, water+less+ness, sea+water"). Morpheme boundaries are not normally marked in text unless they coincide with word boundaries. The main objective of this thesis is to devise a method that discovers the likely locations of the morpheme boundaries in words of any language. The method proposed, called Morfessor, learns a simple model of concatenative morphology (word forming) in an unsupervised manner from plain text. Morfessor is formulated as a Bayesian, probabilistic model. That is, it does not rely on predefined grammatical rules of the language, but makes use of statistical properties of the input text. Morfessor situates itself between two types of existing unsupervised methods: morphology learning vs. word segmentation algorithms. In contrast to existing morphology learning algorithms, Morfessor can handle words consisting of a varying and possibly high number of morphemes. This is a requirement for coping with highly-inflecting and compounding languages, such as Finnish. In contrast to existing word segmentation methods, Morfessor learns a simple grammar that takes into account sequential dependencies, which improves the quality of the proposed segmentations. Morfessor is evaluated in two complementary ways in this work: directly by comparing to linguistic reference morpheme segmentations of Finnish and English words and indirectly as a component of a large (or virtually unlimited) vocabulary Finnish speech recognition system. In both cases, Morfessor is shown to outperform state-of-the-art solutions. The linguistic reference segmentations were produced as part of the current work, based on existing linguistic resources. This has resulted in a morphological gold standard, called Hutmegs, containing analyses of a large number of Finnish and English word forms.en
dc.description.versionrevieweden
dc.format.extent110, [130]
dc.format.mimetypeapplication/pdf
dc.identifier.isbn951-22-8211-9
dc.identifier.issn1459-7020
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/2715
dc.identifier.urnurn:nbn:fi:tkk-007028
dc.language.isoenen
dc.publisherHelsinki University of Technologyen
dc.publisherTeknillinen korkeakoulufi
dc.relation.haspartMathias Creutz and Krista Lagus. Unsupervised Discovery of Morphemes. In: Proceedings of the 6th Meeting of the ACL Special Interest Group in Computational Phonology in cooperation with the ACL Special Interest Group in Natural Language Learning: Workshop on Morphological and Phonological Learning, held in conjunction with the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), pages 21-30, Philadelphia, Pennsylvania, USA, July 2002. [article1.pdf] © 2002 Association for Computational Linguistics. By permission.
dc.relation.haspartMathias Creutz. Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-03), pages 280-287, Sapporo, Japan, July 2003. [article2.pdf] © 2003 Association for Computational Linguistics. By permission.
dc.relation.haspartMathias Creutz and Krista Lagus. Induction of a Simple Morphology for Highly-Inflecting Languages. In: Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Workshop on Current Themes in Computational Phonology and Morphology, held in conjunction with the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 43-51, Barcelona, Spain, July 2004. [article3.pdf] © 2004 Association for Computational Linguistics. B y permission.
dc.relation.haspartMathias Creutz and Krista Lagus. Inducing the Morphological Lexicon of a Natural Language from Unannotated Text. In: Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR05), pages 106-113, Espoo, Finland, June 2005. [article4.pdf] © 2005 by authors.
dc.relation.haspartMathias Creutz and Krista Lagus. Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing, accepted for publication, 2006.
dc.relation.haspartVesa Siivola, Teemu Hirsimäki, Mathias Creutz, and Mikko Kurimo. Unlimited Vocabulary Speech Recognition Based on Morphs Discovered in an Unsupervised Manner. In: Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH 2003), pages 2293-2296, Geneva, Switzerland, September 2003. [article6.pdf] © 2003 International Speech Communication Association (ISCA). By permission.
dc.relation.haspartTeemu Hirsimäki, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Sami Virpioja, and Janne Pylkkönen. Unlimited Vocabulary Speech Recognition with Morph Language Models Applied to Finnish. Computer Speech and Language, in press, 2006. [article7.pdf] © 2006 Elsevier Science. By permission.
dc.relation.haspartMathias Creutz and Krister Lindén. Morpheme Segmentation Gold Standards for Finnish and English. Helsinki University of Technology, Publications in Computer and Information Science, Report A77, October 2004. [article8.pdf] © 2004 by authors.
dc.relation.ispartofseriesDissertations in computer and information science. Report Den
dc.relation.ispartofseries13en
dc.subject.keywordmorpheme segmentationen
dc.subject.keywordmorphology inductionen
dc.subject.keywordunsupervised learningen
dc.subject.keywordprobabilistic modelsen
dc.subject.keywordconcatenative morphologyen
dc.subject.keywordagglutinative languagesen
dc.subject.keywordunlimited vocabulary speech recognitionen
dc.subject.keywordFinnishen
dc.subject.keywordEnglishen
dc.subject.otherComputer scienceen
dc.titleInduction of the morphology of natural language : unsupervised morpheme segmentation with application to automatic speech recognitionen
dc.typeG5 Artikkeliväitöskirjafi
dc.type.dcmitypetexten
dc.type.ontasotVäitöskirja (artikkeli)fi
dc.type.ontasotDoctoral dissertation (article-based)en
local.aalto.digiauthask
local.aalto.digifolderAalto_67340

Files

Original bundle

Now showing 1 - 8 of 8
No Thumbnail Available
Name:
isbn9512282119.pdf
Size:
767.25 KB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
article1.pdf
Size:
132.95 KB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
article2.pdf
Size:
95.81 KB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
article3.pdf
Size:
93.89 KB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
article4.pdf
Size:
107.12 KB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
article6.pdf
Size:
104.8 KB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
article7.pdf
Size:
265.98 KB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
article8.pdf
Size:
156.56 KB
Format:
Adobe Portable Document Format