Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

Cho, Hyunmin; Han, Woo Kyoung; Jin, Kyong Hwan

Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

Hyunmin Cho¹, Woo Kyoung Han¹, Kyong Hwan Jin^1†

¹Department of Electrical Engineering, Korea University
^†Corresponding author

ICML 2026

arXiv Paper 🤗 HF Paper Code BibTeX

Decomposition of QK^T into symmetric (energy) and skew (circulation) components

☹ div ☺ fid

☹ div ☹ fid

☺ div ☺ fid

☺ div ☹ fid

Low $\mathbf{E}$ (Stable) High $\mathbf{E}$ (Unstable)

We decompose the pre-softmax attention $\mathbf{Q}\mathbf{K}^\top$ into a symmetric part (Hopfield-style energy landscape) and a skew-symmetric part (circulation). Moderate skew perturbation breaks metastable mixtures while preserving stable retrievals; excessive perturbation destabilizes everything. Icons: ☺/☹ denote positive/negative diversity (div) and fidelity (fid).

Abstract

We characterize the pre-softmax attention matrix $\mathbf{Q}\mathbf{K}^\top$ in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew-symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew-symmetric component as driving circulation on that landscape. Leveraging the energy formulation induced by the symmetric component, we derive Hopfield-style stability measures that quantify the stability of retrieved features. We observe meaningful correlations between these measures and the fidelity–diversity trade-off in generation. Finally, we propose a controllable knob to modulate this trade-off by modifying the circulation of the underlying dynamics.

TL;DR — The pre-softmax attention $\mathbf{Q}\mathbf{K}^\top$ is an associative memory. Its symmetric part is a Hopfield-style energy landscape — whose minima can be incoherent metastable mixtures. Its skew part is a circulation field on that landscape. From the symmetric part we read off three stability measures that strongly correlate with external quality metrics — a diagnostic for when attention is stuck. The circulation knob is then a natural consequence: amplify it to escape the stuck state.

Why? — The Metastable-Mixture Problem

Diffusion models often blend incompatible features — materials mixed across distinct objects, anatomically implausible structures, attribute leakage. The culprit is the same global connectivity in attention that helps compositional generation: it can also settle into incoherent combinations of distinct patterns.

Most prior analyses operate at a token-wise level — treating attention as retrieval and reading off a single token at a time. That view misses the interaction dynamics encoded by the attention matrix itself: how features collectively settle, and whether they settle into fixed points or cycles.

We take an associative-memory view of $\mathbf{Q}\mathbf{K}^\top$, building on the observation that transformer self-attention approximates the update rule of a modern Hopfield network. From this lens, spurious mixing is entrapment in metastable states — local energy minima where the model rests on an incoherent superposition of patterns.

Associative Memory View of Self-Attention

Treat the input features $\mathbf{X}\in\mathbb{R}^{L\times d_{\rm in}}$ as a collection of $d_{\rm in}$ feature vectors $\boldsymbol{x}^{(i)}\in\mathbb{R}^L$. The pre-softmax attention $\mathbf{Q}\mathbf{K}^\top = \mathbf{X}\mathbf{W}\mathbf{X}^\top$ then reads as a pairwise association strength between feature pairs $(i,j)$ — the natural object in a Hopfield-style associative memory.

Associative memory framework: (a) input features as vectors, (b) interaction matrix W, (c) decomposition of QK^T into symmetric + skew, (d) circulation perturbs metastable mixtures.

(a)–(b) Inputs as feature vectors; learned interaction matrix $\mathbf{W}$ encodes association strength between pairs. (c) $\mathbf{Q}\mathbf{K}^\top$ decomposes into a symmetric part (static energy landscape governing stability) and a skew-symmetric part (circulation, a directional force). (d) Standard retrieval can settle into metastable mixtures — e.g., the incoherent “three legs” configuration; amplifying the skew component injects circulation that perturbs the metastable state and restores coherence.

Symmetric / Skew Decomposition

Every pre-softmax attention matrix uniquely decomposes:

$\mathbf{Q}\mathbf{K}^\top \;=\; \underbrace{\tfrac{1}{2}(\mathbf{Q}\mathbf{K}^\top+\mathbf{K}\mathbf{Q}^\top)}_{\text{symmetric } \mathbf{X}\mathbf{S}\mathbf{X}^\top \text{ (energy)}} \;+\; \underbrace{\tfrac{1}{2}(\mathbf{Q}\mathbf{K}^\top-\mathbf{K}\mathbf{Q}^\top)}_{\text{skew } \mathbf{X}\mathbf{N}\mathbf{X}^\top \text{ (circulation)}}.$

Symmetric part $\mathbf{S}$ — defines a Hopfield-style energy $E_{\mathbf{X}}(\xi)$ on retrieved features $\xi$. Its local minima are the stable attractors; some of those minima are incoherent metastable mixtures.
Skew part $\mathbf{N}$ — satisfies $\mathbf{u}^\top\mathbf{N}\mathbf{u}=0$ for every $\mathbf{u}$, so it contributes nothing to the energy. Instead it acts as a directional circulation field on top of the landscape — and, by classical asymmetric-Hopfield results, increasing it exponentially reduces the number of stable attractors.

This split cleanly separates what is stable (read from $\mathbf{S}$) from what perturbs the stable structure (read from $\mathbf{N}$).

Hopfield Stability Measures

From the symmetric energy $E_{\mathbf{X}}(\xi)$ and the induced local field $\boldsymbol{h}_{\mathbf{X}}(\xi)$ we read off three complementary diagnostics, each capturing a different aspect of how settled the retrieval is:

Hopfield energy $E_{\mathbf{X}}$ — global self-consistency of the retrieved configuration. Lower is more coherent.
Instability fraction $r_{\mathbf{X}}$ — share of features in local conflict (sign disagreement between $\xi$ and its driving field).
Alignment score $\mathbf{Align}_{\mathbf{X}}$ — mean directional agreement between $\xi$ and $\boldsymbol{h}_{\mathbf{X}}(\xi)$.

These three are internal — computed purely from the model's own attention — and together they distinguish coherent retrievals from metastable mixtures.

The Stability Measures Track External Quality

Spearman rank correlation $\rho$ between the three internal stability measures (computed at each block of the SDXL UNet) and the standard external metrics, over 1K MSCOCO prompts. The pattern is consistent and intuitive:

Aesthetic Score — positively correlates with stability at every depth (more coherent ⇒ higher quality).
LPIPS Diversity — inversely correlates with stability — metastable mixtures are the source of diversity (and of hallucinations).
CLIPScore / ImageReward — depth-dependent, with strongest signal in the Down and Up blocks.

Metric A \ Measure B	Down (UNet 0–47)			Mid (48–67)			Up (68–139)			All
Metric A \ Measure B	$-E_{\mathbf X}$	$r_{\mathbf X}$	$\mathbf{Align}_{\mathbf X}$	$-E_{\mathbf X}$	$r_{\mathbf X}$	$\mathbf{Align}_{\mathbf X}$	$-E_{\mathbf X}$	$r_{\mathbf X}$	$\mathbf{Align}_{\mathbf X}$	$-E_{\mathbf X}$	$r_{\mathbf X}$	$\mathbf{Align}_{\mathbf X}$
Aesthetic Score	+0.181	−0.162	+0.151	+0.207	−0.229	+0.204	+0.255	−0.255	+0.280	+0.265	−0.273	+0.296
LPIPS Diversity	−0.074	+0.192	−0.194	−0.336	+0.283	−0.250	−0.270	+0.237	−0.238	−0.279	+0.283	−0.297
CLIPScore	+0.040	+0.155	−0.202	−0.158	+0.088	−0.042	−0.006	−0.073	+0.142	−0.010	+0.030	−0.014
ImageReward	+0.129	−0.161	+0.146	−0.168	+0.102	−0.090	−0.122	+0.114	−0.192	−0.074	+0.046	−0.074

stability ↔ higher metric (positive Hopfield correlation) stability ↔ lower metric (associated with diversity/conflict)

Aesthetic Score lights up the entire top row in blue — high coherence and high perceived quality co-occur at every depth. LPIPS diversity is the mirror: it lights up the entire row in red, confirming that metastability is the source of diversity. CLIP and ImageReward show the same theme with depth-dependent signs. The framework is not just descriptive — the measures are predictive of where quality lives.

A Natural Consequence: Circulation as a Knob

The framework immediately suggests an intervention. Since $\mathbf{N}$ controls circulation without altering the energy landscape, we modulate it by a scalar $\alpha$, then blend the perturbed retrieval $\Xi_\alpha$ back into the baseline $\Xi$ with a second scalar $\beta$:

$\Xi_\alpha \;\triangleq\; \Phi\!\left(\mathbf{X}\mathbf{S}\mathbf{X}^\top + \alpha\,\mathbf{X}\mathbf{N}\mathbf{X}^\top\right)\mathbf{X}, \qquad \Xi_{\rm blended} \;\triangleq\; \Xi + \beta\,(\Xi_\alpha - \Xi).$

No retraining, no architectural change. The intervention is regime-dependent: on Unstable (low Alignment) samples it breaks spurious mixtures and improves quality; on Stable samples it injects benign variation. Quantitative tables and the $\alpha{\times}\beta$ operating curve are in the paper / code repo; below we show the regime-dependent qualitative behavior.

Qualitative Comparison

Five Unstable baselines (left) and five Stable baselines (right). Top row is the baseline; bottom row is the same prompt with our skew-perturbation applied. On Unstable samples the perturbation breaks spurious mixtures and yields cleaner, object-centric structure. On Stable samples the same perturbation injects local variation (texture, composition) — illustrating the operating-point trade-off.

Baseline
Proposed
	Skew Perturbation on Unstable Samples (Tab. 2)	Skew Perturbation on Stable Samples

Left: on unstable baselines, our intervention fixes spurious mixtures (extra limbs, blended objects, fragmented compositions) and recovers a coherent, object-centric subject. Right: on stable baselines, the same intervention injects benign variation in texture / background / composition — sometimes useful, sometimes drift. This regime-dependent behavior is the operating-point trade-off discussed in the paper.

BibTeX

@inproceedings{
cho2026balancing,
title={Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective},
author={Hyunmin Cho and Woo Kyoung Han and Kyong Hwan Jin},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=E0MKfKmQkT}
}