A task-agnostic framework for analyzing internal layer-wise dynamics of self-supervised Transformers.
Self-supervised learning (SSL) models, such as Wav2Vec2, HuBERT, and WavLM, have become foundational across a wide range of speech and audio tasks. Despite their success, fully understanding their internal layer-wise dynamics remains an ongoing challenge. To address this, we propose a two-part model-centric analysis. First, we establish a task-agnostic framework from three intrinsic per-layer perspectives: compression (entropy), geometry (curvature), and robustness to perturbations. We show that varying training objectives induce distinct regimes of acoustic compression and manifold unfolding. Second, we introduce the Generative Compatibility Matrix (GCM) to evaluate functional transferability across layers, exposing stable phonetic cores, identity volatility, and deep-layer semantic pruning. Finally, linear probing connects the model-centric perspective to downstream tasks, demonstrating how layer topology dictates phoneme, pitch, and speaker encoding.
Explore intrinsic perspectives through the InsideSSL framework.
Processing Plotly Data...
Evolution of embedding curvature during the first training iteration of a HuBERT model, from initialization to convergence.
Optimization begins with a surge in curvature (up to ≈ 1.4), reflecting the rapid encoding of acoustic complexity from MFCC labels.
Early layers (0-40%) maintain high curvature to encode intricate features, while deeper layers begin to separate in regime.
Final layers (> 40%) undergo a systematic curvature drop, "flattening" representations to facilitate linear separability.
Evaluating representational transferability for Wav2Vec2 across four key dimensions.
Stable phonetic core observed across middle layers (1-10).
Speaker identity is often discarded in deep layers to focus on linguistics.
Sharp semantic pruning occurs specifically at terminal layers (layer 11).
Loading Matrices...
Listen to representations decoded from layer $D^{(i)}$ using tokens from layer $z^{(j)}$.
Click/Select a cell
The complete implementation of the InsideSSL analysis framework is available on GitHub.
Access RepositoryIf you find the InsideSSL framework useful in your research, please cite our paper.
@inproceedings{insidessl_2026,
title={InsideSSL: Understanding Self-Supervised Speech Representations using a Model-Centric Perspective},
author={Anonymous Authors},
year={2026}
}