Residual Tokens Enhance Masked Autoencoders For Speech Modeling
Audio Comparison
This section presents speech analysis/resynthesis results, which are simply obtained by using RT-MAE to map a Mel-spectrogram to the corresponding speech attributes (analysis stage) and then back to the Mel-spectrogram (generation stage), using only the discrete attributes as in (Sadok et al. 2025), or with the additional information provided by the residual tokens (Ours).
You need headphones to hear all the details clearly.
Librispeech-test
Analysis of these signals demonstrates that the integration of residual tokens leads to significantly improved synthesis quality. Furthermore, the speaker’s identity is more faithfully preserved, while high content fidelity is maintained.
|
Original |
Without residual tokens |
With residual tokens |
Emo-V
These examples illustrate that the residual tokens capture additional emotional and non-verbal details not encoded by explicit attributes, such as laughter in the second signal or breathing/exhaling in the first example.
|
Original |
Without residual tokens |
With residual tokens |
LibriMix-test
We note here that the residual tokens capture specific characteristics of the background noise that are not represented within the set of explicit attributes.
|
Original |
Without residual tokens |
With residual tokens |
What is encoded in the residual latent space?
We reconstruct the Mel-spectrograms using only the residual tokens, completely masking out all the speech attribute tokens.
This experiment isolates the information captured by the residual latent space, revealing what remains when phonetic and prosodic attributes are removed.
|
Original |
Synthesis using only the residual tokens |