Text classification Based on Machine Learning Methods

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
With the rapid development of Internet technology, text data on the Internet is growing significantly, and the traditional manual text classification method has been unable to cope with the current data volume. Automatic text classification technology has become a research hot spot which can effectively solve the problem. The improvement of machine learning technology also accelerates the technology of text classification. This thesis introduces the process of text classification, and divides the process into 3 parts, which are text preprocessing, word embedding and classification models. In each part, the methods and models used have been described in detail. Chinese news text is used as the dataset, there is no space between words in a Chinese sentence, which is different from English. In preprocessing part, punctuation, numbers and stop words will be removed. Jieba library is used to do word segmentation. During the second part, 4 methods are used to do word embedding which are word2vec, doc2vec, tfidf and embedding layer. Doc2vec and tfidf word embeddings are used in machine learning classification models. There are 2 input ways in deep learning models, which are the pretrained word2vec embeddings, and the embedding layer which will be trained in the first layer of deep learning model. In the classification model part, 10 models are utilized, 2 machine learning models which are Naive Bayes and SVM, and the other deep learning models include MLP, CNN, RNN and their variants. Among all the algorithms, the '2 layer GRU model with pretrained word2vec embeddings' model gets the highest accuracy. This thesis also uses half sized dataset and double sized dataset to explore whether the volume of dataset will impact the accuracy of text classification. The result is models which use half sized dataset get lower accuracy, on the contrary, most of the models use double sized dataset get higher accuracy compared to normal sized dataset.
Kurimo, Mikko
Thesis advisor
Kurimo, Mikko
text classification, word embedding, machine learning, data mining
Other note