Gated Attention: A Simple Fix for Softmax Attention

17 Dec 2025 in BLOGs & DAILY THINKING 2026-02-01

Notes from “Gated Attention for Large Language Models”, showing how a simple post-attention sigmoid gate improves performance, training stability, and long-context generalization by introducing non-linearity and sparsity while removing attention sinks.

This post summarizes key takeaways from Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free (arXiv:2505.06708v1, May 2025).

What problem does it target?

Standard softmax attention can develop attention sinks, where early tokens absorb a disproportionate amount of attention. This can hurt attention dynamics and make long-context extension brittle.

The main idea

Add a head-specific sigmoid gate after SDPA output (i.e., gate the per-head attention output before concatenation/output projection). Among many tested gating placements, this “post-SDPA gate” is the most consistently effective.

Why it works (two mechanisms)

Non-linearity: The value projection and output projection form a low-rank linear mapping; inserting a gate adds non-linearity and increases expressiveness.
Query-dependent sparsity: The sigmoid gate produces sparse gating scores (many near 0), filtering irrelevant context conditioned on the current query token.

Empirical highlights

Better perplexity and benchmark scores across both dense (1.7B) and MoE (~15B) settings.
Improved training stability (fewer loss spikes), enabling larger learning rates.
Strong reduction of attention sinks (first-token attention drops dramatically).
Better long-context extrapolation after context extension (notably on RULER with 64k–128k).

Practical notes

Head-specific gating matters more than sharing gates across heads.
Multiplicative sigmoid gating works better than additive variants in the reported setups.

Gated Attention: A Simple Fix for Softmax Attention

What problem does it target?

The main idea

Why it works (two mechanisms)

Empirical highlights

Practical notes

Lujun LI

He Is Not A Stochastic Parrot

Error

What problem does it target?

The main idea

Why it works (two mechanisms)

Empirical highlights

Practical notes

Templates (for web app):

Error