Samir Sadok
I am a postdoctoral research scientist at
INRIA
in Grenoble, France, under the supervision of
Xavier Alameda-Pineda.
I completed my PhD at
CentraleSupélec,
advised by
Simon Leglaive.
NewsJun 2026
Paper
🎉 Three papers have been accepted at INTERSPEECH 2026! Our contributions span self-supervised speech representation analysis with "InsideSSL", discrete speech representations for 3D facial animation, and "HybridCodec", a hybrid discrete-continuous framework for efficient Speech Language Models.
May 2026
Paper
Our paper "Equalizer" has been uploaded to arXiv! Check it out.
Mar 2026
Talk
I will be giving a presentation on audio-visual speech representation learning at INRIA's seminar.
Oct 2025
Event
Attending Interspeech 2025 to present our work on interpretability in neural audio codecs. Come say hi!
ThesisAudiovisual Speech Representation Learning Applied to Emotion RecognitionThis thesis develops unsupervised and self-supervised generative models for multimodal and sequential audiovisual speech, learning disentangled latent representations that enhance interpretability and enable effective emotion recognition, signal analysis, transformation, and generation. ResearchMy research focuses on multimodal generative models for audiovisual speech. I aim to develop interpretable generative models to enhance data analysis, control, and generation.
InsideSSL: Understanding Self-Supervised Speech Representations using a Model-Centric Perspective
This study provides a layer-wise analysis of speech SSL models (Wav2Vec2, HuBERT, and WavLM), characterizing their compression, geometry, and robustness properties while assessing cross-layer functional transferability to reveal how their internal structure shapes the encoding of phonemes, pitch, and speaker identity.
From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation
Evaluating four speech representation families for 3D facial synthesis, we found that phonetic encoding in semantic and label-based representations is vital for accurate facial animation. Building on this, we introduce an AVTTS pipeline using shared discrete representations to simultaneously decode high-quality speech and 3D facial motion.
HybridCodec: Modeling Discrete and Continuous Representations For Efficient Speech Language Models
To address discretization loss in multimodal LLMs, we propose a hybrid framework combining temporally compressed discrete tokens with dimensionality-reduced continuous residuals via a focal modulation codec.
The Equalizer: Introducing Shape-Gain Decomposition in Neural Audio Codecs
Neural audio codecs jointly encode gain and shape in a single latent space, making them sensitive to level variations and inefficient in bitrate–distortion performance. We introduce a shape–gain decomposition that separately quantizes the gain and processes the normalized shape with the codec, significantly improving efficiency and reducing complexity.
Residual tokens enhance masked autoencoders for speech modeling
RT-MAE leverages Perceiver-style cross-attention to extract unsupervised residual tokens that capture natural speech nuances like timbre and emotion. A dropout-based regularization strategy (τ) ensures these tokens enhance expressivity without compromising the interpretability or controllability of explicit attributes.
Bringing Interpretability to Neural Audio Codecs
Neural audio codecs efficiently encode continuous speech waveforms into low-rate discrete units but often lack interpretability because they are optimized mainly for reconstruction. This work introduces a two-step approach—analysis and synthesis—using AnCoGen to understand how speech attributes (content, identity, pitch) are encoded and to extract them directly from codec tokens.
AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder
This article presents AnCoGen, a unified masked-autoencoder model that can analyze, control, and generate speech from key attributes such as speaker identity, pitch, content, and signal quality, demonstrating strong performance across analysis-resynthesis, pitch estimation and modification, and speech enhancement tasks.
A vector quantized masked autoencoder for audiovisual speech emotion recognition
This paper introduces VQ-MAE-AV, a self-supervised multimodal model that learns discrete audiovisual speech representations via masked autoencoders and vector-quantized VAEs, enabling state-of-the-art emotion recognition with minimal labeled data.
Can AI Decode the Circumplex Model of Affect? A Data-driven Study
This study uses Transformer-based models on text and audio to uncover emotional latent spaces, showing that multimodal representations more closely replicate Russell’s circumplex model of affect and highlighting the benefits of combining modalities in emotion analysis.
A multimodal dynamical variational autoencoder for audiovisual speech representation learning
This paper introduces MDVAE, a multimodal and dynamical variational autoencoder that learns disentangled audiovisual speech representations by separating static, dynamic, modality-specific, and shared latent factors through a two-stage unsupervised training pipeline—leveraging VQ-VAE features—and demonstrates strong performance in speech manipulation, audiovisual denoising, and low-label emotion recognition.
A vector quantized masked autoencoder for speech emotion recognition
This paper presents VQ-MAE-S, a self-supervised speech model combining masked autoencoders and vector-quantized VAEs, which, when pre-trained on VoxCeleb2 and fine-tuned on emotional speech, achieves state-of-the-art performance in speech emotion recognition.
Learning and controlling the source-filter representation of speech with a variational autoencoder
This work demonstrates that a VAE trained on unlabeled speech naturally encodes source-filter factors in orthogonal latent subspaces, allowing accurate, independent control of fundamental frequency and formants for speech transformation using only a few seconds of labeled synthetic data. |
A glimpse into my creative side outside of research.