Studies on Training Text Selection for Conversational Finnish Language Modeling

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.author Enarvi, Seppo
dc.contributor.author Kurimo, Mikko
dc.date.accessioned 2017-08-03T12:08:40Z
dc.date.available 2017-08-03T12:08:40Z
dc.date.issued 2013
dc.identifier.citation Enarvi , S & Kurimo , M 2013 , Studies on Training Text Selection for Conversational Finnish Language Modeling . in 10th International Workshop on Spoken Language Translation, (IWSLT 2013), Heidelberg, 5 Dec 2013 - 6 Dec 2013 . pp. 256-263 . en
dc.identifier.other PURE UUID: 50ee6ee5-0608-48b3-a658-219c719b3bb7
dc.identifier.other PURE ITEMURL: https://research.aalto.fi/en/publications/studies-on-training-text-selection-for-conversational-finnish-language-modeling(50ee6ee5-0608-48b3-a658-219c719b3bb7).html
dc.identifier.other PURE LINK: http://workshop2013.iwslt.org/downloads/Studies_on_Training_Text_Selection_for_Conversational_Finnish_Language_Modeling.pdf
dc.identifier.other PURE FILEURL: https://research.aalto.fi/files/14166819/Studies_on_Training_Text_Selection_for_Conversational_Finnish_Language_Modeling.pdf
dc.identifier.uri https://aaltodoc.aalto.fi/handle/123456789/27374
dc.description VK: coin
dc.description.abstract Current ASR and MT systems do not operate on conversational Finnish, because training data for colloquial Finnish has not been available. Although speech recognition performance on literary Finnish is already quite good, those systems have very poor baseline performance in conversational speech. Text data for relevant vocabulary and language models can be collected from the Internet, but web data is very noisy and most of it is not helpful for learning good models. Finnish language is highly agglutinative, and written phonetically. Even phonetic reductions and sandhi are often written down in informal discussions. This increases vocabulary size dramatically and causes word-based selection methods to fail. Our selection method explicitly optimizes the perplexity of a subword language model on the development data, and requires only very limited amount of speech transcripts as development data. The language models have been evaluated for speech recognition using a new data set consisting of generic colloquial Finnish. en
dc.format.extent 8
dc.format.extent 256-263
dc.format.mimetype application/pdf
dc.language.iso en en
dc.relation.ispartofseries 10th International Workshop on Spoken Language Translation, (IWSLT 2013), Heidelberg, 5 Dec 2013 - 6 Dec 2013 en
dc.rights openAccess en
dc.subject.other 213 Electronic, automation and communications engineering, electronics en
dc.subject.other 113 Computer and information sciences en
dc.subject.other 114 Physical sciences en
dc.subject.other 111 Mathematics en
dc.title Studies on Training Text Selection for Conversational Finnish Language Modeling en
dc.type A4 Artikkeli konferenssijulkaisussa fi
dc.description.version Peer reviewed en
dc.contributor.department Speech Recognition
dc.contributor.department Department of Signal Processing and Acoustics en
dc.subject.keyword 213 Electronic, automation and communications engineering, electronics
dc.subject.keyword 113 Computer and information sciences
dc.subject.keyword 114 Physical sciences
dc.subject.keyword 111 Mathematics
dc.identifier.urn URN:NBN:fi:aalto-201708036342
dc.type.version publishedVersion


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search archive


Advanced Search

article-iconSubmit a publication

Browse

My Account