1Inria at Univ. Grenoble Alpes, CNRS, LJK, France
2CentraleSupélec, IETR UMR CNRS 6164, France
3Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France
4LTCI, Télécom Paris, Institut polytechnique de Paris, France
Article | Code (will be released upon acceptance)
This website presents qualitative results obtained with AnCoGen, a masked autoencoder model for the analysis, control, and generation of speech. As illustrated in the above figure, AnCoGen relies on a masking strategy to provide a bidirectional mapping between a Mel-spectrogram and a set of speech attributes representing not only the linguistic content, prosody (pitch and loudness), and speaker identity, but also the acoustic recording conditions in terms of noise level and reverberation.
For speech analysis, AnCoGen takes a Mel-spectrogram as input and estimates the speech attributes. Conversely, for speech generation the model takes the speech attributes as input and predicts a Mel-spectrogram. The speech waveform is then reconstructed using a HiFi-GAN neural vocoder. Importantly, the analysis and the generation of speech are achieved by the same model, by masking either the speech attributes or the Mel-spectrogram at the input of AnCoGen. As shown in the qualitative examples below, we can address various speech processing tasks by controlling the speech attributes in between the analysis and generation steps: pitch shifting, speech denoising, dereverberation, voice conversion.
For more details about the model and the experimental setup, please refer to the paper.
In this section we present analysis results, which correspond to the estimation of the speech attributes from a Mel-spectrogram. In each of the figures below, the blue line represents the ground-truth attribute and the red line represents the prediction of AnCoGen. Please see the paper for a complete description of the attributes.
Scroll to the right to see the different attributes.
This section presents speech analysis/resynthesis results, which are simply obtained by using AnCoGen to map a Mel-spectrogram to the corresponding speech attributes (analysis stage) and then back to the Mel-spectrogram (generation stage).
Original |
Reconstruction with AnCoGeN |
This section presents analysis, transformation, and synthesis results, where the speech attributes are controlled between the analysis and generation stages in order to perform speech denoising (by increasing the SNR attribute), pitch shifting, dereverberation (by increasing the C50 attribute) or voice conversion (by controlling the speaker identity attribute). Note that the paper only includes quantitative results for the speech denoising and pitch shifting tasks.
Speech denoising
Noisy speech |
Estimated clean speech |
Ground-truth clean speech |
Reverberant speech |
Estimated clean speech |
Ground-truth clean speech |
Target identity ✱ |
Source signal ✱ |
Voice conversion with AnCoGen |