Abstract
We characterize the pre-softmax attention matrix $\mathbf{Q}\mathbf{K}^\top$ in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew-symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew-symmetric component as driving circulation on that landscape. Leveraging the energy formulation induced by the symmetric component, we derive Hopfield-style stability measures that quantify the stability of retrieved features. We observe meaningful correlations between these measures and the fidelity–diversity trade-off in generation. Finally, we propose a controllable knob to modulate this trade-off by modifying the circulation of the underlying dynamics.
Why? — The Metastable-Mixture Problem
Diffusion models often blend incompatible features — materials mixed across distinct objects, anatomically implausible structures, attribute leakage. The culprit is the same global connectivity in attention that helps compositional generation: it can also settle into incoherent combinations of distinct patterns.
Most prior analyses operate at a token-wise level — treating attention as retrieval and reading off a single token at a time. That view misses the interaction dynamics encoded by the attention matrix itself: how features collectively settle, and whether they settle into fixed points or cycles.
We take an associative-memory view of $\mathbf{Q}\mathbf{K}^\top$, building on the observation that transformer self-attention approximates the update rule of a modern Hopfield network. From this lens, spurious mixing is entrapment in metastable states — local energy minima where the model rests on an incoherent superposition of patterns.
Associative Memory View of Self-Attention
Treat the input features $\mathbf{X}\in\mathbb{R}^{L\times d_{\rm in}}$ as a collection of $d_{\rm in}$ feature vectors $\boldsymbol{x}^{(i)}\in\mathbb{R}^L$. The pre-softmax attention $\mathbf{Q}\mathbf{K}^\top = \mathbf{X}\mathbf{W}\mathbf{X}^\top$ then reads as a pairwise association strength between feature pairs $(i,j)$ — the natural object in a Hopfield-style associative memory.
(a)–(b) Inputs as feature vectors; learned interaction matrix $\mathbf{W}$ encodes association strength between pairs. (c) $\mathbf{Q}\mathbf{K}^\top$ decomposes into a symmetric part (static energy landscape governing stability) and a skew-symmetric part (circulation, a directional force). (d) Standard retrieval can settle into metastable mixtures — e.g., the incoherent “three legs” configuration; amplifying the skew component injects circulation that perturbs the metastable state and restores coherence.
Symmetric / Skew Decomposition
Every pre-softmax attention matrix uniquely decomposes:
- Symmetric part $\mathbf{S}$ — defines a Hopfield-style energy $E_{\mathbf{X}}(\xi)$ on retrieved features $\xi$. Its local minima are the stable attractors; some of those minima are incoherent metastable mixtures.
- Skew part $\mathbf{N}$ — satisfies $\mathbf{u}^\top\mathbf{N}\mathbf{u}=0$ for every $\mathbf{u}$, so it contributes nothing to the energy. Instead it acts as a directional circulation field on top of the landscape — and, by classical asymmetric-Hopfield results, increasing it exponentially reduces the number of stable attractors.
This split cleanly separates what is stable (read from $\mathbf{S}$) from what perturbs the stable structure (read from $\mathbf{N}$).
Hopfield Stability Measures
From the symmetric energy $E_{\mathbf{X}}(\xi)$ and the induced local field $\boldsymbol{h}_{\mathbf{X}}(\xi)$ we read off three complementary diagnostics, each capturing a different aspect of how settled the retrieval is:
- Hopfield energy $E_{\mathbf{X}}$ — global self-consistency of the retrieved configuration. Lower is more coherent.
- Instability fraction $r_{\mathbf{X}}$ — share of features in local conflict (sign disagreement between $\xi$ and its driving field).
- Alignment score $\mathbf{Align}_{\mathbf{X}}$ — mean directional agreement between $\xi$ and $\boldsymbol{h}_{\mathbf{X}}(\xi)$.
These three are internal — computed purely from the model's own attention — and together they distinguish coherent retrievals from metastable mixtures.
The Stability Measures Track External Quality
Spearman rank correlation $\rho$ between the three internal stability measures (computed at each block of the SDXL UNet) and the standard external metrics, over 1K MSCOCO prompts. The pattern is consistent and intuitive:
- Aesthetic Score — positively correlates with stability at every depth (more coherent ⇒ higher quality).
- LPIPS Diversity — inversely correlates with stability — metastable mixtures are the source of diversity (and of hallucinations).
- CLIPScore / ImageReward — depth-dependent, with strongest signal in the Down and Up blocks.
| Metric A \ Measure B | Down (UNet 0–47) | Mid (48–67) | Up (68–139) | All | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| $-E_{\mathbf X}$ | $r_{\mathbf X}$ | $\mathbf{Align}_{\mathbf X}$ | $-E_{\mathbf X}$ | $r_{\mathbf X}$ | $\mathbf{Align}_{\mathbf X}$ | $-E_{\mathbf X}$ | $r_{\mathbf X}$ | $\mathbf{Align}_{\mathbf X}$ | $-E_{\mathbf X}$ | $r_{\mathbf X}$ | $\mathbf{Align}_{\mathbf X}$ | |
| Aesthetic Score | +0.181 | −0.162 | +0.151 | +0.207 | −0.229 | +0.204 | +0.255 | −0.255 | +0.280 | +0.265 | −0.273 | +0.296 |
| LPIPS Diversity | −0.074 | +0.192 | −0.194 | −0.336 | +0.283 | −0.250 | −0.270 | +0.237 | −0.238 | −0.279 | +0.283 | −0.297 |
| CLIPScore | +0.040 | +0.155 | −0.202 | −0.158 | +0.088 | −0.042 | −0.006 | −0.073 | +0.142 | −0.010 | +0.030 | −0.014 |
| ImageReward | +0.129 | −0.161 | +0.146 | −0.168 | +0.102 | −0.090 | −0.122 | +0.114 | −0.192 | −0.074 | +0.046 | −0.074 |
Aesthetic Score lights up the entire top row in blue — high coherence and high perceived quality co-occur at every depth. LPIPS diversity is the mirror: it lights up the entire row in red, confirming that metastability is the source of diversity. CLIP and ImageReward show the same theme with depth-dependent signs. The framework is not just descriptive — the measures are predictive of where quality lives.
A Natural Consequence: Circulation as a Knob
The framework immediately suggests an intervention. Since $\mathbf{N}$ controls circulation without altering the energy landscape, we modulate it by a scalar $\alpha$, then blend the perturbed retrieval $\Xi_\alpha$ back into the baseline $\Xi$ with a second scalar $\beta$:
No retraining, no architectural change. The intervention is regime-dependent: on Unstable (low Alignment) samples it breaks spurious mixtures and improves quality; on Stable samples it injects benign variation. Quantitative tables and the $\alpha{\times}\beta$ operating curve are in the paper / code repo; below we show the regime-dependent qualitative behavior.
Qualitative Comparison
Five Unstable baselines (left) and five Stable baselines (right). Top row is the baseline; bottom row is the same prompt with our skew-perturbation applied. On Unstable samples the perturbation breaks spurious mixtures and yields cleaner, object-centric structure. On Stable samples the same perturbation injects local variation (texture, composition) — illustrating the operating-point trade-off.
| Baseline | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| Proposed | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
| Skew Perturbation on Unstable Samples (Tab. 2) | Skew Perturbation on Stable Samples | ||||||||||
Left: on unstable baselines, our intervention fixes spurious mixtures (extra limbs, blended objects, fragmented compositions) and recovers a coherent, object-centric subject. Right: on stable baselines, the same intervention injects benign variation in texture / background / composition — sometimes useful, sometimes drift. This regime-dependent behavior is the operating-point trade-off discussed in the paper.
BibTeX
@inproceedings{
coming soon...
}



















