Natural speech is far richer than what explicit attributes (pitch, content, identity) can represent. Current methods often miss residual factors such as timbre variations, noise, emotion, and micro-prosody that are not captured by labeled factors.
Building on the Masked Autoencoder (MAE) framework, our system processes three modalities:
Learnable queries (Perceiver-style) aggregate relevant Mel-Spectrogram information into a compact residual representation.
To prevent residual tokens from encoding all information and bypassing the attributes, we apply a dropout-based strategy:
This forces the model to effectively utilize explicit attributes for structure while using residuals for naturalness and expressivity.
A dedicated token \( R_{noise} \) is activated only when noise is present in the input.
| Model | N-MOS ↑ | COS ↑ |
|---|---|---|
| AnCoGen | 4.04 | 0.81 |
| RT-MAE (Ours) | 4.32 | 0.92 |
RT-MAE demonstrates that a structured residual space is essential for high-fidelity speech synthesis. It enables fine-grained control while maintaining natural vocal characteristics.
Code & Demo
samsad35.github.io/site-residual