InsideSSL: Understanding Self-Supervised Speech Representations using a Model-Centric Perspective

A task-agnostic framework for analyzing internal layer-wise dynamics of self-supervised Transformers.

Scientific Abstract

Self-supervised learning (SSL) models, such as Wav2Vec2, HuBERT, and WavLM, have become foundational across a wide range of speech and audio tasks. Despite their success, fully understanding their internal layer-wise dynamics remains an ongoing challenge. To address this, we propose a two-part model-centric analysis. First, we establish a task-agnostic framework from three intrinsic per-layer perspectives: compression (entropy), geometry (curvature), and robustness to perturbations. We show that varying training objectives induce distinct regimes of acoustic compression and manifold unfolding. Second, we introduce the Generative Compatibility Matrix (GCM) to evaluate functional transferability across layers, exposing stable phonetic cores, identity volatility, and deep-layer semantic pruning. Finally, linear probing connects the model-centric perspective to downstream tasks, demonstrating how layer topology dictates phoneme, pitch, and speaker encoding.

Project page and code: https://inside-ssl.github.io/

Contribution Summary

I. Task-Agnostic Framework: Multi-perspective analysis via entropy, curvature, and invariance metrics.
II. Functional Transfer: Using GCM to expose representational shifts and pruning zones.
III. Downstream Connection: Linking model-centric topology to specific encoding capabilities.

Layer-wise Dynamics

Explore intrinsic perspectives through the InsideSSL framework.

Analytical Lens

Model Size

Processing Plotly Data...

Training Dynamics &
Geometric Relaxation

Evolution of embedding curvature during the first training iteration of a HuBERT model, from initialization to convergence.

Rapid Encoding

Optimization begins with a surge in curvature (up to ≈ 1.4), reflecting the rapid encoding of acoustic complexity from MFCC labels.

Structural Separation

Early layers (0-40%) maintain high curvature to encode intricate features, while deeper layers begin to separate in regime.

Deep Relaxation

Final layers (> 40%) undergo a systematic curvature drop, "flattening" representations to facilitate linear separability.

GCM: Generative Compatibility Matrix

Evaluating representational transferability for Wav2Vec2 across four key dimensions.

1. Phonetic Consistency

Stable phonetic core observed across middle layers (1-10).

2. Identity Volatility

Speaker identity is often discarded in deep layers to focus on linguistics.

4. Functional Rupture

Sharp semantic pruning occurs specifically at terminal layers (layer 11).

Loading Matrices...

GCM Matrix: Audio Explorer

Listen to representations decoded from layer $D^{(i)}$ using tokens from layer $z^{(j)}$.

Now Playing

Open Source Code

The complete implementation of the InsideSSL analysis framework is available on GitHub.

Access Repository

Cite This Work

If you find the InsideSSL framework useful in your research, please cite our paper.

@inproceedings{insidessl_2026,
  title={InsideSSL: Understanding Self-Supervised Speech Representations using a Model-Centric Perspective},
  author={Samir sadok, Xavier Alameda-Pineda},
  booktitle={INTERSPEECH},
  year={2026}
}