AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

Samir Sadok1   Simon Leglaive2   Laurent Girin3   Gaël Richard4   Xavier Alameda-Pineda1  

1Inria at Univ. Grenoble Alpes, CNRS, LJK, France
2CentraleSupélec, IETR UMR CNRS 6164, France
3Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France
4LTCI, Télécom Paris, Institut polytechnique de Paris, France

Article | Code (will be released upon acceptance)



ancogen

This website presents qualitative results obtained with AnCoGen, a masked autoencoder model for the analysis, control, and generation of speech. As illustrated in the above figure, AnCoGen relies on a masking strategy to provide a bidirectional mapping between a Mel-spectrogram and a set of speech attributes representing not only the linguistic content, prosody (pitch and loudness), and speaker identity, but also the acoustic recording conditions in terms of noise level and reverberation.

For speech analysis, AnCoGen takes a Mel-spectrogram as input and estimates the speech attributes. Conversely, for speech generation the model takes the speech attributes as input and predicts a Mel-spectrogram. The speech waveform is then reconstructed using a HiFi-GAN neural vocoder. Importantly, the analysis and the generation of speech are achieved by the same model, by masking either the speech attributes or the Mel-spectrogram at the input of AnCoGen. As shown in the qualitative examples below, we can address various speech processing tasks by controlling the speech attributes in between the analysis and generation steps: pitch shifting, speech denoising, dereverberation, voice conversion.

For more details about the model and the experimental setup, please refer to the paper.


Analysis

In this section we present analysis results, which correspond to the estimation of the speech attributes from a Mel-spectrogram. In each of the figures below, the blue line represents the ground-truth attribute and the red line represents the prediction of AnCoGen. Please see the paper for a complete description of the attributes.

Scroll to the right to see the different attributes.



Analysis-resynthesis

This section presents speech analysis/resynthesis results, which are simply obtained by using AnCoGen to map a Mel-spectrogram to the corresponding speech attributes (analysis stage) and then back to the Mel-spectrogram (generation stage).

Example 1/5

Original

Reconstruction with AnCoGeN


Analysis-transformation-synthesis

This section presents analysis, transformation, and synthesis results, where the speech attributes are controlled between the analysis and generation stages in order to perform speech denoising (by increasing the SNR attribute), pitch shifting, dereverberation (by increasing the C50 attribute) or voice conversion (by controlling the speaker identity attribute). Note that the paper only includes quantitative results for the speech denoising and pitch shifting tasks.

Speech denoising

Example 1/5

Noisy speech

Estimated clean speech

Ground-truth clean speech


Pitch shifting


Dereverberation

Example 1/5

Reverberant speech

Estimated clean speech

Ground-truth clean speech



Dereverberation + denoising



Voice conversion

Example 1/9

Target identity

Source signal

Voice conversion with AnCoGen

The source and target signals have been through an analysis-resynthesis process with AnCoGen.