Home > Article > Technology peripherals > Facial Expression Analysis: Integrating Multimodal Information with Transformer
Human emotional behavior analysis has attracted much attention in human-computer interaction (HCI). This article is intended to introduce the paper we submitted to CVPR 2022 Affective Behavior Analysis in-the-wild (ABAW). To fully exploit emotional knowledge, we employ multi-modal features including spoken language, speech prosody, and facial expressions extracted from video clips in the Aff-Wild2 dataset. Based on these features, we propose a transformer-based multi-modal framework for action unit detection and expression recognition. This framework contributes to a more comprehensive understanding of human emotional behavior and provides new research directions in the field of human-computer interaction.
For the current frame image, we first encode it to extract static visual features. At the same time, we also use sliding windows to crop adjacent frames and extract three multi-modal features from image, audio and text sequences. Next, we introduce a transformer-based fusion module to fuse static visual features and dynamic multi-modal features. The cross-attention module in this fusion module helps focus the output integrated features on key parts that are helpful for downstream detection tasks. In order to further improve model performance, we adopted some data balancing techniques, data augmentation techniques and post-processing methods. In the official tests of ABAW3 Competition, our model ranked first on both EXPR and AU tracks. We demonstrate the effectiveness of our proposed method through extensive quantitative evaluation and ablation studies on the Aff-Wild2 dataset.
https://arxiv.org/abs/2203.12367
The above is the detailed content of Facial Expression Analysis: Integrating Multimodal Information with Transformer. For more information, please follow other related articles on the PHP Chinese website!