RESIDUAL TOKENS ENHANCE MASKED AUTOENCODERS
FOR SPEECH MODELING

Samir Sadok Stéphane Lathuilière Xavier Alameda-Pineda
Inria at Univ. Grenoble Alpes, CNRS, LJK, France

Abstract & Context

Natural speech is far richer than what explicit attributes (pitch, content, identity) can represent. Current methods often miss residual factors such as timbre variations, noise, emotion, and micro-prosody that are not captured by labeled factors.

"RT-MAE captures the information not explained by explicit attributes for richer, more natural synthesis."

RT-MAE Architecture

Building on the Masked Autoencoder (MAE) framework, our system processes three modalities:

  • Quantized Mel-Spectrograms
  • Explicit attributes (Pitch, Loudness, PPG)
  • Continuous Residual Tokens
Cross-Attention

Learnable queries (Perceiver-style) aggregate relevant Mel-Spectrogram information into a compact residual representation.

Dropout Regularization

To prevent residual tokens from encoding all information and bypassing the attributes, we apply a dropout-based strategy:

\[ \text{Residuals} = \begin{cases} \mathbf{0} & \text{if } u < \tau \\ \mathbf{R} & \text{otherwise} \end{cases} \]
\( u \sim \mathcal{U}(0,1) \), \( \tau = 0.5 \)

This forces the model to effectively utilize explicit attributes for structure while using residuals for naturalness and expressivity.

Speech Denoising

Noise Modeling

A dedicated token \( R_{noise} \) is activated only when noise is present in the input.

+13% Speaker Similarity
4.25 N-MOS Score

Experimental Results

Model N-MOS ↑ COS ↑
AnCoGen 4.04 0.81
RT-MAE (Ours) 4.32 0.92

Expressivity (EmoV-DB)

Emotion Classification 98.7%

Conclusion

RT-MAE demonstrates that a structured residual space is essential for high-fidelity speech synthesis. It enables fine-grained control while maintaining natural vocal characteristics.