Automatic Facial Expression Recognition in the Presence of Speech

No Thumbnail Available

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Authors

Date

2023-01-23

Department

Major/Subject

Computer Science

Mcode

SCI3042

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

49 + 3

Series

Abstract

Automatic facial expression recognition (AFER) has been extensively studied, as it can be applied in a wide range of applications in daily communications. AFER is a technology that recognizes emotions through facial expressions displayed in both still images and video streams. However, numerous observations reveal a noticeable performance degradation of such approaches in the presence of a talking face. This phenomenon is described as the speaking effect, where spontaneous articulatory behaviors can provoke a significant change in facial appearance, thus drastically confusing the blend of intricate emotional information conveyed through visual representations. Most of the contemporary AFER solutions excel in identifying facial images with a solid intensity of certain emotions but ignore the speaking effect to a larger extent, which leads to an inaccurate output. This work investigates the applicability of incorporating information derived from machine lipreading (LR) into existing facial expression recognition models in order to mitigate the speaking effect. We developed an automatic facial expression recognition model, capable of recognizing the facial expressions of a talking face from video streams and providing the predicted probabilities of the seven basic emotions. The model consists of a preprocessing module and an end-to-end deep neural network (DNN). The preprocessing module focus on extracting facial regions from adjacent consecutive frames and is compatible with various video formats. The DNN is created on top of a combination of an image-based facial expression recognition model (MobileNet) and a lipreading model (LipNet), to compensate for the changes in facial apparatus raised from speakings with additional knowledge obtained from machine lip reading. Our evaluation of the model on RAVDESS and Oulu-CASIA datasets shows a significant improvement in the accuracy of image classification, with a 46 percent increment compared with the baseline model, demonstrating its potential for further analysis associated with facial expressions and emotions.

Description

Supervisor

Aledavood, Talayeh

Thesis advisor

Ikäheimonen, Arsi

Keywords

automatic facial expression recognition, speaking effect, deep neural network, emotion recognition, deep learning

Other note

Citation