A vector quantized masked autoencoder for audiovisual speech emotion recognition

Code will be made available upon publication of the paper

Abstract
VQ-SMAE

The limited availability of labeled data is a major challenge in audiovisual speech emotion recognition (SER). Self-supervised learning approaches have recently been proposed to mitigate the need for labeled data in various applications. This paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning and applied to SER. Unlike previous approaches, the proposed method employs a self-supervised paradigm based on discrete audio and visual speech representations learned by vector quantized variational autoencoders. A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence, which are then used for an SER downstream task. Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on several standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods. Extensive ablation experiments are also provided to assess the contribution of the different model components.

Qualitative Results

The visualization interface enables users to assess the reconstruction quality of VQ-MAE-AV by controlling various variables such as the ratio of audio or visual masking, the identity, and the resolution. The interface displays the original, masked, and reconstructed spectrograms with their corresponding audio time signals for the audio modality. For the visual modality, the interface shows the original, masked, and reconstructed image sequences. Additionally, users can click on the animation to view the reconstructed video. This feature provides a comprehensive view of the reconstruction quality of VQ-MAE-AV for both audio and visual modalities.
To generate the masked spectrograms or images, we replaced the masked tokens with 0 indices and then used the VQ-VAE decoder to reconstruct the masked spectrograms or images.



Original

z_v

Masked

z_v

VQ-MAE-AV-12

z_v
(Ratio audio) 50 %
z_v
z_v
z_v




For the examples below, we have hidden from 90% to 100% of the visual tokens in a non-random manner, starting from the first frame. The more tokens we retain, the more details are visible throughout the entire sequence.

Original

z_v

Masked

z_v

VQ-MAE-AV-12

z_v