A VECTOR QUANTIZED MASKED AUTOENCODER FOR SPEECH EMOTION RECOGNITION

Samir Sadok1   Simon Leglaive1   Renaud Séguier1  

1CentraleSupélec, IETR UMR CNRS 6164, France   

Article | Code

Abstract
VQ-SMAE

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms existing MAE methods that rely on speech spectrogram representations as input.

Qualitative Results
We present some qualitative results below to demonstrate the effectiveness of our proposed VQ-MAE-S method. The figure shows the original spectrograms on the left, the masked spectrograms in the middle, and the reconstructed spectrograms on the right. The user can vary the masking ratio from 50% to 90%. In addition, we provide the corresponding temporal audio signal to listen. To generate the masked spectrograms, we replaced the masked tokens with 0 indices and then used the VQ-VAE decoder to reconstruct the masked spectrogram. These results showcase the ability of our model to accurately reconstruct the missing portions of the spectrograms even with a high degree of masking.

Ratio 50%

Original

z_v

Masked

z_v

VQ-MAE-S-12 (ours)

z_v
z_v z_v z_v
z_v z_v z_v