AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

(Accepted at ICASSP 2025)

Samir Sadok¹ Simon Leglaive² Laurent Girin³ Gaël Richard⁴ Xavier Alameda-Pineda¹

¹Inria at Univ. Grenoble Alpes, CNRS, LJK, France
²CentraleSupélec, IETR UMR CNRS 6164, France
³Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France
⁴LTCI, Télécom Paris, Institut polytechnique de Paris, France

Article | Code

This website presents qualitative results obtained with AnCoGen, a masked autoencoder model for the analysis, control, and generation of speech. As illustrated in the above figure, AnCoGen relies on a masking strategy to provide a bidirectional mapping between a Mel-spectrogram and a set of speech attributes representing not only the linguistic content, prosody (pitch and loudness), and speaker identity, but also the acoustic recording conditions in terms of noise level and reverberation.

For speech analysis, AnCoGen takes a Mel-spectrogram as input and estimates the speech attributes. Conversely, for speech generation the model takes the speech attributes as input and predicts a Mel-spectrogram. The speech waveform is then reconstructed using a HiFi-GAN neural vocoder. Importantly, the analysis and the generation of speech are achieved by the same model, by masking either the speech attributes or the Mel-spectrogram at the input of AnCoGen. As shown in the qualitative examples below, we can address various speech processing tasks by controlling the speech attributes in between the analysis and generation steps: pitch shifting, speech denoising, dereverberation, voice conversion.

For more details about the model and the experimental setup, please refer to the paper.

Analysis

In this section we present analysis results, which correspond to the estimation of the speech attributes from a Mel-spectrogram. In each of the figures below, the blue line represents the ground-truth attribute and the red line represents the prediction of AnCoGen. Please see the paper for a complete description of the attributes.

Scroll to the right to see the different attributes.

Analysis-resynthesis

This section presents speech analysis/resynthesis results, which are simply obtained by using AnCoGen to map a Mel-spectrogram to the corresponding speech attributes (analysis stage) and then back to the Mel-spectrogram (generation stage).

Example 1/5

Original

Reconstruction with AnCoGeN

Analysis-transformation-synthesis

This section presents analysis, transformation, and synthesis results, where the speech attributes are controlled between the analysis and generation stages in order to perform speech denoising (by increasing the SNR attribute), pitch shifting, dereverberation (by increasing the C50 attribute) or voice conversion (by controlling the speaker identity attribute). Note that the paper only includes quantitative results for the speech denoising and pitch shifting tasks.

Speech denoising