Exploration of Knowledge Distillation Methods on Transformer Language Models for Sentiment Analysis

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2023-01-23
Department
Major/Subject
Data Science
Mcode
SCI3095
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
40
Series
Abstract
Despite the outstanding performances of the large Transformer-based language models, it proposes a challenge to compress the models and put them into the industrial environment. This degree project explores model compression methods called knowledge distillation in the sentiment classification task on Transformer models. Transformers are neural models having stacks of identical layers. In knowledge distillation for Transformer, a student model with fewer layers will learn to mimic intermediate layer vectors from a teacher model with more layers by designing and minimizing loss. We implement a framework to compare three knowledge distillation methods: MiniLM, TinyBERT, and Patient-KD. Student models produced by the three methods are evaluated by accuracy score on the SST-2 and SemEval sentiment classification dataset. The student models’ attention matrices are also compared with the teacher model to find the best student model for capturing dependencies in the input sentences. The comparison results show that the distillation method focusing on the Attention mechanism can produce student models with better performances and less variance. We also discover the over-fitting issue in Knowledge Distillation and propose a Two-Step Knowledge Distillation with Transformer Layer and Prediction Layer distillation to alleviate the problem. The experiment results prove that our method can produce robust, effective, and compact student models without introducing extra data. In the future, we would like to extend our framework to support more distillation methods on Transformer models and compare performances in tasks other than sentiment classification.
Description
Supervisor
Chatterjee, Saikat
Thesis advisor
Ghosh, Anubhab
Truong, Michael
Keywords
natural language processing, sentiment analysis, language model, transformers, knowledge distillation
Other note
Citation