Residual Tokens Enhance Masked Autoencoders For Speech Modeling
Accepted at ICASSP 2026
Abstract
Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness
Summary
The RT-MAE (Residual-Token Masked Autoencoder) framework addresses a fundamental limitation in current speech modeling: the inability of explicit attributes \(A\) (such as pitch, content, and speaker identity) to capture the full richness and nuance of natural human speech. While traditional models rely on these labeled factors, they often miss “residual” information like timbre variations, micro-prosody, and emotional cues. To bridge this gap, RT-MAE introduces unsupervised residual trainable tokens (\(R\)) that encode information not explained by supervised attributes. These tokens are extracted using a cross-attention mechanism inspired by the Perceiver architecture (Jaegle et al. 2021), which allows a fixed set of learnable queries to compress the Mel-spectrogram into a compact, latent representation without the computational overhead of self-attention for every frame.
To prevent the model from over-relying on these residual tokens and ignoring the controllable attributes, we implemented a dropout-based regularization strategy. By randomly “dropping” the residual tokens during training (using a threshold τ), the model is forced to prioritize the explicit attributes, ensuring that the final output remains interpretable and controllable.
👉 RT-MAE introduces continuous residual tokens extracted via cross-attention and regularized by dropout, combining controllability from explicit attributes with flexibility from residuals within an MAE framework for speech generation (Sadok et al. 2025).
Key Contributions & Results
- Improved Quality: Experiments on LibriSpeech and EmoV-DB show significant gains in reconstruction quality, speech naturalness (N-MOS), and emotional accuracy compared to attribute-only models like AnCoGen.
- Preserved Controllability: The inclusion of residual tokens enhances expressivity without compromising the ability to manipulate specific features like pitch or loudness.
- Speech Denoising: By isolating noise into a specific residual vector (Rnoise) and deactivating it during inference, RT-MAE effectively performs speech enhancement, removing background noise while maintaining the natural characteristics of the speaker’s voice.