Residual Tokens Enhance Masked Autoencoders For Speech Modeling

Authors

Affiliation

Samir Sadok

Inria at Univ. Grenoble Alpes, CNRS, LJK, France

Stéphane Lathuilière

Xavier Alameda-Pineda

Audio Comparison

This section presents speech analysis/resynthesis results, which are simply obtained by using RT-MAE to map a Mel-spectrogram to the corresponding speech attributes (analysis stage) and then back to the Mel-spectrogram (generation stage), using only the discrete attributes as in (Sadok et al. 2025), or with the additional information provided by the residual tokens (Ours).

Caution

You need headphones to hear all the details clearly.

Librispeech-test

Analysis of these signals demonstrates that the integration of residual tokens leads to significantly improved synthesis quality. Furthermore, the speaker’s identity is more faithfully preserved, while high content fidelity is maintained.

Original	Without residual tokens	With residual tokens

Emo-V

These examples illustrate that the residual tokens capture additional emotional and non-verbal details not encoded by explicit attributes, such as laughter in the second signal or breathing/exhaling in the first example.

Original	Without residual tokens	With residual tokens

LibriMix-test

We note here that the residual tokens capture specific characteristics of the background noise that are not represented within the set of explicit attributes.

Original	Without residual tokens	With residual tokens

What is encoded in the residual latent space?

We reconstruct the Mel-spectrograms using only the residual tokens, completely masking out all the speech attribute tokens.
This experiment isolates the information captured by the residual latent space, revealing what remains when phonetic and prosodic attributes are removed.

Original	Synthesis using only the residual tokens

References

Sadok, Samir, Simon Leglaive, Laurent Girin, Gaël Richard, and Xavier Alameda-Pineda. 2025. “AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder.” In Icassp.