Constructing Word Representations using Subword Embeddings
No Thumbnail Available
URL
Journal Title
Journal ISSN
Volume Title
Sähkötekniikan korkeakoulu |
Master's thesis
Authors
Date
2019-06-17
Department
Major/Subject
Signal, Speech and Language Processing
Mcode
ELEC0007
Degree programme
CCIS - Master’s Programme in Computer, Communication and Information Sciences (TS2013)
Language
en
Pages
78+1
Series
Abstract
Language gives humans an ability to construct a new, previously never used word insuch a way that other speakers can immediately understand this word’s meaning.For example, when encountering a word like ’anti-neo-sovietism’ one can derive itsmeaning from the meaning of its subparts. In this case, one might probably concludethat this word is used to describe a movement that is against a movement for revivingthe Soviet Union.In area of NLP tasks, it is very valuable to be able to represent words in a meaningfulway, so these representation contain semantic and grammatical information about thewords. Recently, methods of representing words as points in multidimensional spacebecame very popular. Algorithms that produce such a representation are based ondistributional hypothesis that states that words appearing in similar contexts havesimilar meanings. However, some methods for creating such word representationslack the ability to produce representation for the previously unseen words, othersrely on subword units that have no connection to meaning. For morphologically richlanguages like Finnish, the problem of representing unseen words is very pressing: theabundance of word forms and the possibility of creating new ones make it impossibleto collect all words in one training corpus.This thesis suggest to create distributional vector space models with morphs astheir units and then to combine them into word representations. This approach ismotivated by the distributional hypothesis and by the compositionality of meaning.We suggest to use morphs because unlike characters orn-grams they are the languageunits that bear meaning.The results of the experiments conducted in this study showed that, in the tasksinvolving a large amount of unknown words, this method outperforms methods basedonn-grams as their subword units.Finally, to test the models, a new intruder evaluation dataset was introduced: wesuggest to use the synsets of WordNet as the categories to sample the words to createtasksDescription
Supervisor
Kurimo, MikkoThesis advisor
Virpioja, SamiGrönroos, Stig-Arne
Keywords
distributional semantics, vector space models, subword representations, morphological segmentation, out of vocabulary