A vector quantized masked autoencoder for audiovisual speech emotion recognition

Samir Sadok^{1, 2} Simon Leglaive¹ Renaud Séguier¹

¹CentraleSupélec, IETR UMR CNRS 6164, France
²Inria at Univ. Grenoble Alpes, CNRS, LJK, France

Computer Vision and Image Understanding
Volume 257, June 2025, 104362

Abstract

n important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder–decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.

Qualitative Results

The visualization interface enables users to assess the reconstruction quality of VQ-MAE-AV by controlling various variables such as the ratio of audio or visual masking, the identity, and the resolution. The interface displays the original, masked, and reconstructed spectrograms with their corresponding audio time signals for the audio modality. For the visual modality, the interface shows the original, masked, and reconstructed image sequences. Additionally, users can click on the animation to view the reconstructed video. This feature provides a comprehensive view of the reconstruction quality of VQ-MAE-AV for both audio and visual modalities.
To generate the masked spectrograms or images, we replaced the masked tokens with 0 indices and then used the VQ-VAE decoder to reconstruct the masked spectrograms or images.

Original

Masked

VQ-MAE-AV-12

Resolution: Animation ID: (Ratio audio) 50 % 50 % (Ratio visual)

For the examples below, we have hidden from 90% to 100% of the visual tokens in a non-random manner, starting from the first frame. The more tokens we retain, the more details are visible throughout the entire sequence.

Original

Masked

VQ-MAE-AV-12

ID: 100 % (Ratio masking)