|
Samir Sadok
I'm a postdoctoral research scientist at INRIA in
Grenoble, France, under the supervision of Xavier
Alameda-Pineda.
I did my PhD at CentraleSupélec,
where I was advised by Simon Leglaive.
My research interests span generative modeling, multimodal learning, speech processing, and
information geometry. I also have teaching experience, having designed and taught courses on
multimodal deep learning and attention mechanisms, and supervised several AI-related student
projects.
Email /
CV /
Scholar
/
Github
|
|
Research
My research focuses on multimodal generative models for audiovisual speech. I aim to develop
interpretable generative models to enhance data analysis, control, and generation.
|
|
|
Bringing Interpretability to Neural Audio Codecs
Samir Sadok*,
Julien Hauret*,
Eric Bavu
Interspeech, 2025 (Oral Presentation)
project page
/
arXiv
Neural audio codecs efficiently encode continuous speech waveforms into low-rate discrete
units but often lack interpretability because they are optimized mainly for reconstruction.
This work introduces a two-step approach—analysis and synthesis—using AnCoGen to understand
how speech attributes (content, identity, pitch) are encoded and to extract them directly
from codec tokens.
|
|
|
AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder
Samir Sadok,
Simon Leglaive, Laurent Girin, Gaël Richard, Xavier Alameda-Pineda
ICASSP, 2025
project page
/
arXiv
This article presents AnCoGen, a unified masked-autoencoder model that can analyze, control,
and generate speech from key attributes such as speaker identity, pitch, content, and signal
quality, demonstrating strong performance across analysis-resynthesis, pitch estimation and
modification, and speech enhancement tasks.
|
|
|
A vector quantized masked autoencoder for audiovisual speech emotion recognition
Samir Sadok,
Simon Leglaive, Renaud Séguier
Journal: Computer Vision and Image Understanding, May 2025
project page
/
arXiv
This paper introduces VQ-MAE-AV, a self-supervised multimodal model that
learns discrete audiovisual speech representations via masked autoencoders and
vector-quantized VAEs, enabling state-of-the-art emotion recognition with minimal labeled
data. |
|
Can AI Decode the Circumplex Model of Affect? A Data-driven Study
Amdjed Belaref, Samir Sadok, Karim M Ibrahim, Zineb Noumir, Renaud Seguier
ICPR, 2025
arXiv
This study uses Transformer-based models on text and audio to uncover emotional latent
spaces, showing that multimodal representations more closely replicate Russell’s circumplex
model of affect and highlighting the benefits of combining modalities in emotion analysis.
|
|
|
A multimodal dynamical variational autoencoder for audiovisual speech representation learning
Samir Sadok,
Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier
Journal: Neural Networks, April 2024
project page
/
arXiv
This paper introduces MDVAE, a multimodal and dynamical variational autoencoder that learns
disentangled audiovisual speech representations by separating static, dynamic,
modality-specific, and shared latent factors through a two-stage unsupervised training
pipeline—leveraging VQ-VAE features—and demonstrates strong performance in speech
manipulation, audiovisual denoising, and low-label emotion recognition.
|
|
|
A vector quantized masked autoencoder for speech emotion recognition
Samir Sadok,
Simon Leglaive, Renaud Séguier
ICASSPW, 2023
project page
/
arXiv
This paper presents VQ-MAE-S, a self-supervised speech model combining masked autoencoders
and vector-quantized VAEs, which, when pre-trained on VoxCeleb2 and fine-tuned on emotional
speech, achieves state-of-the-art performance in speech emotion recognition. |
|
|
Learning and controlling the source-filter representation of speech with a variational autoencoder
Samir Sadok,
Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier
Journal: Speech Communication, March 2023
project page
/
arXiv
This work demonstrates that a VAE trained on unlabeled speech naturally encodes
source-filter factors in orthogonal latent subspaces, allowing accurate, independent control
of fundamental frequency and formants for speech transformation using only a few seconds of
labeled synthetic data. |
|
Recorded Talks
|
|
Academic Service
|
|
Teaching
|
Feel free to steal this website's source code. Do
not scrape the HTML from this page itself, as it includes analytics tags that
you do not want on your own website — use the github code instead. Also, consider
using Leonid Keselman's Jekyll fork of this page.
|
|