Gated Attention: A Simple Fix for Softmax Attention

Gated Attention: A Simple Fix for Softmax Attention

Notes from “Gated Attention for Large Language Models”, showing how a simple post-attention sigmoid gate improves performance, training stability, and long-context generalization by introducing non-linearity and sparsity while removing attention sinks.

This post summarizes key takeaways from Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free (arXiv:2505.06708v1, May 2025).

What problem does it target?

Standard softmax attention can develop attention sinks, where early tokens absorb a disproportionate amount of attention. This can hurt attention dynamics and make long-context extension brittle.

The main idea

Add a head-specific sigmoid gate after SDPA output (i.e., gate the per-head attention output before concatenation/output projection). Among many tested gating placements, this “post-SDPA gate” is the most consistently effective.

Why it works (two mechanisms)

  • Non-linearity: The value projection and output projection form a low-rank linear mapping; inserting a gate adds non-linearity and increases expressiveness.
  • Query-dependent sparsity: The sigmoid gate produces sparse gating scores (many near 0), filtering irrelevant context conditioned on the current query token.

Empirical highlights

  • Better perplexity and benchmark scores across both dense (1.7B) and MoE (~15B) settings.
  • Improved training stability (fewer loss spikes), enabling larger learning rates.
  • Strong reduction of attention sinks (first-token attention drops dramatically).
  • Better long-context extrapolation after context extension (notably on RULER with 64k–128k).

Practical notes

  • Head-specific gating matters more than sharing gates across heads.
  • Multiplicative sigmoid gating works better than additive variants in the reported setups.