Constructing Word Representations using Subword Embeddings

No Thumbnail Available

URL

Journal Title

Journal ISSN

Volume Title

Sähkötekniikan korkeakoulu | Master's thesis

Date

2019-06-17

Department

Major/Subject

Signal, Speech and Language Processing

Mcode

ELEC0007

Degree programme

CCIS - Master’s Programme in Computer, Communication and Information Sciences (TS2013)

Language

en

Pages

78+1

Series

Abstract

Language gives humans an ability to construct a new, previously never used word insuch a way that other speakers can immediately understand this word’s meaning.For example, when encountering a word like ’anti-neo-sovietism’ one can derive itsmeaning from the meaning of its subparts. In this case, one might probably concludethat this word is used to describe a movement that is against a movement for revivingthe Soviet Union.In area of NLP tasks, it is very valuable to be able to represent words in a meaningfulway, so these representation contain semantic and grammatical information about thewords. Recently, methods of representing words as points in multidimensional spacebecame very popular. Algorithms that produce such a representation are based ondistributional hypothesis that states that words appearing in similar contexts havesimilar meanings. However, some methods for creating such word representationslack the ability to produce representation for the previously unseen words, othersrely on subword units that have no connection to meaning. For morphologically richlanguages like Finnish, the problem of representing unseen words is very pressing: theabundance of word forms and the possibility of creating new ones make it impossibleto collect all words in one training corpus.This thesis suggest to create distributional vector space models with morphs astheir units and then to combine them into word representations. This approach ismotivated by the distributional hypothesis and by the compositionality of meaning.We suggest to use morphs because unlike characters orn-grams they are the languageunits that bear meaning.The results of the experiments conducted in this study showed that, in the tasksinvolving a large amount of unknown words, this method outperforms methods basedonn-grams as their subword units.Finally, to test the models, a new intruder evaluation dataset was introduced: wesuggest to use the synsets of WordNet as the categories to sample the words to createtasks

Description

Supervisor

Kurimo, Mikko

Thesis advisor

Virpioja, Sami
Grönroos, Stig-Arne

Keywords

distributional semantics, vector space models, subword representations, morphological segmentation, out of vocabulary

Other note

Citation