Automatic Facial Expression Recognition in the Presence of Speech

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Computer Science
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
49 + 3
Automatic facial expression recognition (AFER) has been extensively studied, as it can be applied in a wide range of applications in daily communications. AFER is a technology that recognizes emotions through facial expressions displayed in both still images and video streams. However, numerous observations reveal a noticeable performance degradation of such approaches in the presence of a talking face. This phenomenon is described as the speaking effect, where spontaneous articulatory behaviors can provoke a significant change in facial appearance, thus drastically confusing the blend of intricate emotional information conveyed through visual representations. Most of the contemporary AFER solutions excel in identifying facial images with a solid intensity of certain emotions but ignore the speaking effect to a larger extent, which leads to an inaccurate output. This work investigates the applicability of incorporating information derived from machine lipreading (LR) into existing facial expression recognition models in order to mitigate the speaking effect. We developed an automatic facial expression recognition model, capable of recognizing the facial expressions of a talking face from video streams and providing the predicted probabilities of the seven basic emotions. The model consists of a preprocessing module and an end-to-end deep neural network (DNN). The preprocessing module focus on extracting facial regions from adjacent consecutive frames and is compatible with various video formats. The DNN is created on top of a combination of an image-based facial expression recognition model (MobileNet) and a lipreading model (LipNet), to compensate for the changes in facial apparatus raised from speakings with additional knowledge obtained from machine lip reading. Our evaluation of the model on RAVDESS and Oulu-CASIA datasets shows a significant improvement in the accuracy of image classification, with a 46 percent increment compared with the baseline model, demonstrating its potential for further analysis associated with facial expressions and emotions.
Aledavood, Talayeh
Thesis advisor
Ikäheimonen, Arsi
automatic facial expression recognition, speaking effect, deep neural network, emotion recognition, deep learning
Other note