Compositional Interpretability

Ward Gauderis , Thomas Dooms , Steven T. Holmer , Kola Ayonrinde , Geraint A. Wiggins

TODO: Add a personalized take on this paper.

Abstract

We present a category-theoretic framework, compositional interpretability, that formalizes mechanistic explanations of neural networks. Rather than relying on informal reasoning, we introduce pairs of syntactic and semantic mappings that must commute to enforce consistency between a model’s decomposition and its observed behaviour. The work separates explanation quality into faithfulness and complexity measures, frames interpretability as an optimization problem, and introduces compressive refinement for model restructuring. The framework demonstrates how existing mechanistic methods fit as special cases and explains why compression heuristics align with human understanding, providing a measurable, optimisable foundation for automating the discovery and evaluation of mechanistic explanations.

Compositional Interpretability

Abstract

How to cite