Gated DeltaNet-2

Overview

Gated DeltaNet-2 is a linear-attention architecture proposed in a May 2026 arXiv preprint (arXiv:2605.22791) that decouples the single scalar gate of Gated DeltaNet and KDA into channel-wise erase and write gates, trained via a chunkwise WY parallel algorithm. At 1.3B parameters on 100B FineWeb-Edu tokens it is reported to beat Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants — strongest gains on long-context RULER. The contribution sits in the linear-attention-versus-softmax race: decoupled gating appears to close the retrieval gap that has historically held state-space models back.

Timeline

2026-05-22-AI-Digest — Preprint lands as part of a heavier-than-usual HuggingFace daily papers drop (alongside π-Bench and ACC); ▲7 on the daily index. Headline result: 1.3B-parameter Gated DeltaNet-2 trained on 100B FineWeb-Edu tokens beats Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants at matched scale, with the strongest gains on long-context RULER — the benchmark linear-attention architectures have historically struggled on.
2026-07-12-AI-Digest — Gated DeltaNet-2 surfaces today in HuggingFace papers as one of four recurrent linear-attention variants compared in arXiv:2607.07953 (“Linear Attention Architectures: Mechanisms, Trade-offs, and Cross-Layer Routing”) — comparative study of softmax attention vs. DeltaNet, Gated DeltaNet, Kimi Delta Attention, and Gated DeltaNet-2 at 350M / 1.3B / 3B scale. Kimi Delta Attention paired with Muon reaches the lowest final validation loss; the paper’s proposed Cross-Layer Value Routing (CLVR) modestly improves DeltaNet variants including Gated DeltaNet-2. Paper flagged in 2026-07-11-AI-Digest‘s Papers section at ▲13; today the same paper is ▲10 on HF daily-papers — twenty-four-hour upvote pattern is down, not up. The corpus framing to carry: the comparative-study framing is the durable read, not the CLVR mechanism itself, and Gated DeltaNet-2 is now inside the reference cohort for linear-attention benchmarking rather than the marquee architecture on its own.

Key Developments

Decoupled Erase/Write Gating: The architectural change is replacing the single scalar gate used in Gated DeltaNet and KDA with separate channel-wise erase and write gates. Combined with the chunkwise WY parallel training algorithm, the recipe trains at parity with prior linear-attention recipes while extending the retrieval-recovery curve further into long-context regimes.
Long-Context RULER Lead at 1.3B: Reported strongest gains over Mamba-2, Gated DeltaNet, KDA, and Mamba-3 are on long-context RULER. That this is the benchmark where the lead is largest is the substantive claim — state-space and linear-attention models have closed gaps on perplexity and short-context evals before without translating to retrieval improvements.