MODEL
Gated DeltaNet-2
Overview
Gated DeltaNet-2 is a linear-attention architecture proposed in a May 2026 arXiv preprint (arXiv:2605.22791) that decouples the single scalar gate of Gated DeltaNet and KDA into channel-wise erase and write gates, trained via a chunkwise WY parallel algorithm. At 1.3B parameters on 100B FineWeb-Edu tokens it is reported to beat Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants — strongest gains on long-context RULER. The contribution sits in the linear-attention-versus-softmax race: decoupled gating appears to close the retrieval gap that has historically held state-space models back.
Timeline
- 2026-05-22-AI-Digest — Preprint lands as part of a heavier-than-usual HuggingFace daily papers drop (alongside π-Bench and ACC); ▲7 on the daily index. Headline result: 1.3B-parameter Gated DeltaNet-2 trained on 100B FineWeb-Edu tokens beats Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants at matched scale, with the strongest gains on long-context RULER — the benchmark linear-attention architectures have historically struggled on.
Key Developments
-
Decoupled Erase/Write Gating: The architectural change is replacing the single scalar gate used in Gated DeltaNet and KDA with separate channel-wise erase and write gates. Combined with the chunkwise WY parallel training algorithm, the recipe trains at parity with prior linear-attention recipes while extending the retrieval-recovery curve further into long-context regimes.
-
Long-Context RULER Lead at 1.3B: Reported strongest gains over Mamba-2, Gated DeltaNet, KDA, and Mamba-3 are on long-context RULER. That this is the benchmark where the lead is largest is the substantive claim — state-space and linear-attention models have closed gaps on perplexity and short-context evals before without translating to retrieval improvements.