Text classification Based on Machine Learning Methods

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorKurimo, Mikko
dc.contributor.authorLi, Saihan
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.supervisorKurimo, Mikko
dc.date.accessioned2019-08-25T15:13:29Z
dc.date.available2019-08-25T15:13:29Z
dc.date.issued2019-08-19
dc.description.abstractWith the rapid development of Internet technology, text data on the Internet is growing significantly, and the traditional manual text classification method has been unable to cope with the current data volume. Automatic text classification technology has become a research hot spot which can effectively solve the problem. The improvement of machine learning technology also accelerates the technology of text classification. This thesis introduces the process of text classification, and divides the process into 3 parts, which are text preprocessing, word embedding and classification models. In each part, the methods and models used have been described in detail. Chinese news text is used as the dataset, there is no space between words in a Chinese sentence, which is different from English. In preprocessing part, punctuation, numbers and stop words will be removed. Jieba library is used to do word segmentation. During the second part, 4 methods are used to do word embedding which are word2vec, doc2vec, tfidf and embedding layer. Doc2vec and tfidf word embeddings are used in machine learning classification models. There are 2 input ways in deep learning models, which are the pretrained word2vec embeddings, and the embedding layer which will be trained in the first layer of deep learning model. In the classification model part, 10 models are utilized, 2 machine learning models which are Naive Bayes and SVM, and the other deep learning models include MLP, CNN, RNN and their variants. Among all the algorithms, the '2 layer GRU model with pretrained word2vec embeddings' model gets the highest accuracy. This thesis also uses half sized dataset and double sized dataset to explore whether the volume of dataset will impact the accuracy of text classification. The result is models which use half sized dataset get lower accuracy, on the contrary, most of the models use double sized dataset get higher accuracy compared to normal sized dataset.en
dc.format.extent48
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/39907
dc.identifier.urnURN:NBN:fi:aalto-201908254968
dc.language.isoenen
dc.programmeMaster’s Programme in Computer, Communication and Information Sciencesfi
dc.programme.majorMacadamiafi
dc.programme.mcodeSCI3044fi
dc.subject.keywordtext classificationen
dc.subject.keywordword embeddingen
dc.subject.keywordmachine learningen
dc.subject.keyworddata miningen
dc.titleText classification Based on Machine Learning Methodsen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessyes

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
master_Li_Saihan_2019.pdf
Size:
960.66 KB
Format:
Adobe Portable Document Format