CEPrompt: Cross-Modal Emotion-Aware Prompting for Facial Expression Recognition

Abstract

Facial expression recognition (FER) remains a challenging task due to the ambiguity and subtlety of expressions. To address this challenge, current FER methods predominantly prioritize visual cues while inadvertently neglecting the potential insights that can be gleaned from other modalities. Recently, vision-language pre-training (VLP) models integrated textual cues as guidance, culminating in a powerful multi-modal solution that has proven effective for a range of computer vision tasks. In this paper, we propose a Cross-Modal Emotion-Aware Prompting (CEPrompt) framework for FER based on VLP models. To make VLP models sensitive to expression-relevant visual discrepancies, we devise an Emotion Conception-guided Visual Adapter (EVA) to capture the category-specific appearance representations with emotion conception guidance. Moreover, knowledge distillation is employed to prevent the model from forgetting the pre-trained category-invariant knowledge. In addition, we design a Conception-Appearance Tuner (CAT) to facilitate the interaction of multi-modal information via cooperatively tuning between emotion conception and appearance prompts. In this way, semantic information about emotion text conception is infused directly into facial appearance images, thereby enhancing a comprehensive and precise understanding of expression-related facial details. Quantitative and qualitative experiments show that our CEPrompt outperforms state-of-the-art approaches on three real-world FER datasets. The code is available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/HaoliangZhou/CEPrompt</uri>.

References

Page 1

	Year	Citations

Page 1